biocore-ntnu / epic2

Ultraperformant reimplementation of SICER
https://doi.org/10.1093/bioinformatics/btz232
MIT License
55 stars 9 forks source link

Unexpected behavior keep-duplicates 0.0.41 #34

Closed mpaya closed 4 years ago

mpaya commented 4 years ago

Hi. While using the option --keep-duplicates I noticed that it is not working as expected.

The process used on the analysis is:

Usually it's okay to use marked or filtered files on other software with distinct outputs, but here the behavior is weird when using the filtered or marked files. I'll indicate some relevant outputs below:

## initial file, 36366253 reads
# epic2 remove dupes
epic2 -t 1-filt.bam --chromsizes chromsizes --effective-genome-fraction ${egf} \
   --gaps-allowed $gaps --bin-size 50 > /dev/null
Valid ChIP reads: 27116487 (27116487 before out of bounds removal)
 Number of islands found: 27152

# keep dupes
epic2 -t 1-filt.bam --chromsizes chromsizes --effective-genome-fraction ${egf} \
   --gaps-allowed $gaps --bin-size 50 --keep-duplicates > /dev/null
Valid ChIP reads: 31353061 (31353061 before out of bounds removal)
Number of islands found: 29271

## marking dupes with picard, 36366253 reads
# remove dupes
epic2 -t 2-MarkDupes.bam --chromsizes chromsizes --effective-genome-fraction ${egf} \
   --gaps-allowed $gaps --bin-size 50 > /dev/null
Valid ChIP reads: 27048450 (27048450 before out of bounds removal)
 Number of islands found: 27061

# keep dupes
epic2 -t 2-MarkDupes.bam --chromsizes chromsizes --effective-genome-fraction ${egf} \
   --gaps-allowed $gaps --bin-size 50 --keep-duplicates > /dev/null
Valid ChIP reads: 27048450 (27048450 before out of bounds removal)
Number of islands found: 27061

## after removal of dupes, 28956311 reads
epic2 -t 3-Filt.bam --chromsizes chromsizes --effective-genome-fraction ${egf} \
   --gaps-allowed $gaps --bin-size 50 > /dev/null
Valid ChIP reads: 27048450 (27048450 before out of bounds removal)
Number of islands found: 27061

I'd expect:

I get that the process used to remove duplicates may be different, but these shouldn't be dropped out if not asked for even if they are marked. Whenever you have time, please take a look at why this is happening. Thanks

endrebak commented 4 years ago

Thanks for reporting :) .

I am guessing this is happening because I do some filtering by default on bam files. The flag --filter-flag is by default set to 1540. The https://broadinstitute.github.io/picard/explain-flags.html explains this as

    read unmapped (0x4)
    read fails platform/vendor quality checks (0x200)
    read is PCR or optical duplicate (0x400)

So your problem is likely that the reads are dropped due to this. Could you try with filter-flag set to 0 and see if this solves your problem?

endrebak commented 4 years ago

Anyways, I should really print what the --filter-flags were when using bam-files :) Otherwise, I expect this will confuse many people in the future too :)

endrebak commented 4 years ago

@mpaya Congrats on the paper btw!

mpaya commented 4 years ago

Thanks! ^^ After reading your comments, it explains what happened. If filter-flags by default removes duplicates, keep-duplicates may have none to keep. When I have some moment I'll test this option. Best.

mpaya commented 4 years ago

Closing issue, using keep-duplicates and filter-flag set to 0 gives the expected result. Thanks