MikkelSchubert / paleomix

Pipelines and tools for the processing of ancient and modern HTS data.
https://paleomix.readthedocs.io/en/stable/
MIT License
43 stars 19 forks source link

Should the .rmdup.collapsed.bam and .rmdup.normal.bam be merged? #43

Closed vitalirazumov closed 1 year ago

vitalirazumov commented 3 years ago

Hi Mikkel,

I have used your pipeline in my Master's thesis to trim and map my target capture reads. The pipeline has run without errors, and as a result I got .rmdup.collapsed.bam and .rmdup.normal.bam files for each individual. My next goal is to calculate coverage per gene. I have a question considering the further treatment of the mentioned files. Would you recommend me to merge the two files somehow or to treat them separately, or is this not an appropriate way to deal with PCR duplicates? Our data is of very high quality and 150 bp PE reads. So, if you think that it’s more appropriate, we could also bypass collapsing overlapping reads and use them as normal PE reads.

Best regards, Vitali Razumov

MikkelSchubert commented 3 years ago

Hi Vitali,

The two files you mention (.rmdup.collapsed.bam and .rmdup.normal.bam) are intermediate files generated by the pipeline, prior to it filtering PCR duplicates. You should not be using these files without good reason. The BAM file you should be using is the ${Sample}.${Genome}.bam file located in the root of your output directory (the same directory as the YAML file by default). That file will have been appropriately filtered for PCR duplicates in a unless you specifically turned it off.

See here for a detailed description of the output files of the pipeline: https://paleomix.readthedocs.io/en/stable/bam_pipeline/filestructure.html

I would generally recommend merging PE reads, regardless of read quality. The only exception is if the DNA fragment size is much greater than about 300 bp (ie. twice your read length), since false positive merged reads will likely dominate in that case.

Best regards, Mikkel