BQSR filters out reads 'not in target region'

pcostanza commented 3 years ago

I noticed in the logs that when using the BQSR option, reads not in target region are filtered out. In our case this is unwanted behavior, since we intend to use the output bam as our only archived copy of the data. As such we would like to keep a maximum amount of data. The read removal doesn't seem to happen when omitting BQSR, so I'm not sure if this is a feature or a bug.

Matthias

Originally posted by @matthdsm in https://github.com/ExaScience/elprep/issues/44#issuecomment-798879126

pcostanza commented 3 years ago

@matthdsm This is actually a "feature" in the sense that it reproduces exactly the same behavior as in GATK: When you pass the -L option to ApplyBQSR, it also produces a BAM file that filters out reads that are not in the specified regions. This also has an impact on HaplotypeCaller: When you pass the -L option to HaplotypeCaller, it will work only on the specified regions, but adds some padding around those. If we wouldn't filter out reads in our BQSR step, our HaplotypeCaller would therefore take more reads into account than the original HaplotypeCaller. (GATK's ApplyBQSR with -L option effectively cancels out the padding in GATK's HaplotypeCaller with -L option.)

If you have a better idea how we should handle this case while remaining compatible, we would be very interested to hear suggestions. :)

Pascal

matthdsm commented 3 years ago

Hi Pascal,

Thanks for the clarification. I get the need for feature parity between Elprep and the GATK, it's a big selling point. With this explanation, I think it's handled as it should be.

Thanks for clearing up, feel free to close. Cheers Matthias

ExaScience / elprep

BQSR filters out reads 'not in target region' #46