Name sorted bam of different length

freekvh commented 5 years ago

Hi,

I'm using disambiguate, the C version (I used conda to install ngs_disambiguate). I have mapped (using STAR) against Human and Mouse reference genomes. After that I name sorted the bam files and fed them to Disambiguate. Now I noticed that the 2 bam files that go into disambiguate have differing lengths (number of lines), I guess this is because of multimapping reads, some reads map to multiple locations in human and not in mouse, and multimapping reads are represented on multiple line in the bam file.

My question is: are my assumptions correct and if so, can disambiguate deal with this or is it really comparing line by line? (Because then it would go wrong)

And while I am here, I wonder, can I also disambiguate bam files that were aligned against the transcriptome (using STAR's option --quantMode TranscriptomeSAM)? Because then I could feed the result into RSEM immediately... Or do I have to use a genome BAM file and make it fastq files again and then feed that to RSEM?

Highest regards,

Freek.

mjafin commented 5 years ago

Hi Freek, Thanks for the question. The algorithm compares by read name, regardless of how many alignments (primary or secondary) the read has (none to arbitrarily many). Therefore there is no requirement for there to be the same number of lines in the files.

Transcriptome shouldn't matter as the comparison is based on read name only really. Let me know how it goes and if you come across any problems.

The bcbio pipeline has a different approach to transcriptome disambiguation using sailfish if you're interested in a very fast approach.

Best wishes, Miika

freekvh commented 5 years ago

Hi Miika,

Thanks for answering, I got it working as expected, thank you.

Freek.

AstraZeneca-NGS / disambiguate

Name sorted bam of different length #16