ExaScience / elprep

elPrep: a high-performance tool for analyzing sequence alignment/map files in sequencing pipelines.
Other
287 stars 40 forks source link

Behaviour of markduplicates wrt pairs where one read is mapped to another c'some #6

Closed keithj closed 9 years ago

keithj commented 9 years ago

I've used elprep split to split a BAM file by chromosome. The test chunk for one chromosome has these counts:

samtools flagstat in.bam
198799 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
198799 + 0 mapped (100.00%:-nan%)
198799 + 0 paired in sequencing
119421 + 0 read1
79378 + 0 read2
115896 + 0 properly paired (58.30%:-nan%)
178706 + 0 with itself and mate mapped
20093 + 0 singletons (10.11%:-nan%)
61880 + 0 with mate mapped to a different chr
21845 + 0 with mate mapped to a different chr (mapQ>=5)

I've run elprep on this chunk:

elprep --mark-duplicates --sorting-order keep in.bam test1.sam

On completion, I run samtools flagstat on the elprep result:

samtools flagstat test1.sam 
136919 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
84287 + 0 duplicates
136919 + 0 mapped (100.00%:-nan%)
136919 + 0 paired in sequencing
64918 + 0 read1
72001 + 0 read2
115896 + 0 properly paired (84.65%:-nan%)
116826 + 0 with itself and mate mapped
20093 + 0 singletons (14.68%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

As a comparison I've run bammarkduplicates from https://github.com/gt1/biobambam. The result of samtools flagstat for this is:

samtools flagstat test2.bam
198799 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
84287 + 0 duplicates
198799 + 0 mapped (100.00%:-nan%)
198799 + 0 paired in sequencing
119421 + 0 read1
79378 + 0 read2
115896 + 0 properly paired (58.30%:-nan%)
178706 + 0 with itself and mate mapped
20093 + 0 singletons (10.11%:-nan%)
61880 + 0 with mate mapped to a different chr
21845 + 0 with mate mapped to a different chr (mapQ>=5)

I didn't ask elprep to remove any reads, so I was surprised to find the number in the output was reduced compared to the input. Comparing the three results, I can see that elprep and bammarkduplicates agree on the duplicate count. Subtracting the number of reads after elprep from the initial value (198799 - 136919 = 61880) yields the number of reads with mate mapped to a different chr.

From this I understand that it removes reads with mates mapped to a different chr. There was no second file written, so are they discarded? Am I using elprep correctly in this case? Thanks.

caherzee commented 9 years ago

Hi Keith,

I think I need to explain better what the elprep split operation does.

When you call the elprep split command, it splits up the file per chromosome. It also creates a file with the unmapped reads (e.g. your-file-name-unmapped.bam), and a file with the pairs where the reads map to different chromosomes (e.g. your-file-name-spread.bam).

The files split per chromosome contain all the reads that map to that chromosome. However, if those reads are part of a pair where the mate maps to a different chromosome, these reads are marked specially with a user-defined tag that tells elPrep this read is “optional information”. These reads are namely also stored in the file for reads that are part of pairs where the the reads map to different chromosomes (cf. your-file-name-spread.bam). So these reads are “duplicated” in two files. The reason for this “duplicate” storing of the reads is to guarantee that elPrep has all information to do duplicate marking correctly per file.

When elPrep processes a file and sees reads that are marked as “optional information”, it will silently remove them from the output. The reason for this, is that we assume that the file with pairs where reads map to different chromosomes will also be processed separately, and that the results will be merged with the elprep merge command. We want to avoid that the same read ends up multiple times in the output.

So, I think what you see is the following. You process a single split file with elPrep. elPrep sees reads that are marked as “optional information” and does not include them in the output. When your process the split file with Biobambam, Biobambam is not aware of the reads that are marked as “optional information”, and simply includes them in the output.

We have always assumed that the elprep split command would be used in conjunction with the elprep merge command. We also always assumed that a user would process all the reads, rather than individual chromosomes.

Do you have a different use case? Would you like to be able to use the elprep split command in combination with other tools than elPrep? Would you like to have control over whether or not the reads we consider “optional information” are included in the output?

I will definitely update the documentation so that it is more clear what the elprep split command currently does.

Thanks for reporting the problem.

Best,

Charlotte

On 18 Mar 2015, at 18:50, Keith James notifications@github.com<mailto:notifications@github.com> wrote:

I've used elprep split to split a BAM file by chromosome. The test chunk for one chromosome has these counts:

samtools flagstat in.bam 198799 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 198799 + 0 mapped (100.00%:-nan%) 198799 + 0 paired in sequencing 119421 + 0 read1 79378 + 0 read2 115896 + 0 properly paired (58.30%:-nan%) 178706 + 0 with itself and mate mapped 20093 + 0 singletons (10.11%:-nan%) 61880 + 0 with mate mapped to a different chr 21845 + 0 with mate mapped to a different chr (mapQ>=5)

I've run elprep on this chunk:

elprep --mark-duplicates --sorting-order keep in.bam test1.sam

On completion, I run samtools flagstat on the elprep result:

samtools flagstat test1.sam 136919 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 84287 + 0 duplicates 136919 + 0 mapped (100.00%:-nan%) 136919 + 0 paired in sequencing 64918 + 0 read1 72001 + 0 read2 115896 + 0 properly paired (84.65%:-nan%) 116826 + 0 with itself and mate mapped 20093 + 0 singletons (14.68%:-nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)

As a comparison I've run bammarkduplicates from https://github.com/gt1/biobambam. The result of samtools flagstat for this is:

samtools flagstat test2.bam 198799 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 84287 + 0 duplicates 198799 + 0 mapped (100.00%:-nan%) 198799 + 0 paired in sequencing 119421 + 0 read1 79378 + 0 read2 115896 + 0 properly paired (58.30%:-nan%) 178706 + 0 with itself and mate mapped 20093 + 0 singletons (10.11%:-nan%) 61880 + 0 with mate mapped to a different chr 21845 + 0 with mate mapped to a different chr (mapQ>=5)

I didn't ask elprep to remove any reads, so I was surprised to find the number in the output was reduced compared to the input. Comparing the three results, I can see that elprep and bammarkduplicates agree on the duplicate count. Subtracting the number of reads after elprep from the initial value (198799 - 136919 = 61880) yields the number of reads with mate mapped to a different chr.

From this I understand that it removes reads with mates mapped to a different chr. There was no second file written, so are they discarded? Am I using elprep correctly in this case? Thanks.

— Reply to this email directly or view it on GitHubhttps://github.com/ExaScience/elprep/issues/6.

keithj commented 9 years ago

Hi Charlotte,

Thanks for your very complete explanation. Re-reading the documentation today I can understand where those reads went. I am now using your Python split-merge wrapper, so those reads should be handled transparently.

In my earlier experiments I was trying to break down the process into the smallest steps and run counts at each point to make sure that I understood what was happening.