antonisdim / haystac

Code repository for the HAYSTAC pipeline
MIT License
12 stars 4 forks source link

DeDup and merged reads #1

Closed ivelsko closed 3 years ago

ivelsko commented 3 years ago

Hey, I got this question and I don't know how to answer it, so I'm passing it on to you. Since it's code-related, I've made it an issue: "Do you know why they are including --merged in the DeDup command? The DeDup algorithm only works on merged paired-end reads. If you're somehow including orphaned unmerged reads, you will end up incorrectly DeDuplicating reads because start/end position of a merged read can be the same as the start/end position of a molecule that wasn't completely sequenced and therefore you don't know the true end. If they propose this tool for pathogen detection (which I think is what they are doing, although this isn't clear from the manuscript where all the tests are basically taxonomic profiling?), they are risking underestimating read counts/coverage estimates because DeDup is falsely deduplicated un-related reads?"

Is there a specific reason for this flag, or should it maybe be removed to prevent underestimating read counts?

antonisdim commented 3 years ago

Hello Irina, hope you are doing great ! We do not deduplicate before the abundance estimation. If we deduplicated before the abundance estimation, then we would not be reporting the relative molecular abundance, but instead a relative abundance of unique molecules. We only deduplicate before the mapDamage2 analysis. As the version of mapDamage we use has not been optimised for PE reads, our assumption is that the reads provided as input are either SE pr PE collapsed, in which case the --merged flag stands. If PE reads are provided as input, an error will be raised explaining that mapDamage cannot be run for that input type. Please let us know if anything is still unclear !

jfy133 commented 3 years ago

Hi @antonisdim,

I was the person who originally asked @ivelsko above. Thanks for clarifying - sorry that I got the abundance calculation thing wrong (I should've looked at the code a bit more indepth).

If it's just mapDamage2 that would be affected, I guess it's more minor in this case. By not optimised for PE reads I'm assuming you mean unmerged?

Regardless, if you also want to give more precise daamge profiles, please note the following:

""DeDup was designed for ultra short molecules (where all reads fitting within a sequencing kit's number of cycles) sequenced on paired-end sequencing machines. Therefore it should technically be only used on merged paired-end reads.

While you can supply single-end data and supply --merged for the tool to run, it will assume the SE reads are merged reads and will consider both ends of the reads for deduplication. However note this is NOT the correct procedure as you do not necessarily know the true 'end' of a single-ended molecule and therefore could be accidentally retaining duplicates and falsely increasingly your coverage.

Furthermore, while you can mix unmerged reads with merged reads (e.g. if they don't overlap) with the read prefixing system above, please note the following behaviour: reads with M_ with same start/end coordinates will be deduplicated. Two reads with M_ with same start but different end will be not deduplicated. When combining M_ reads with reads with a F_ or R_, DeDup will look only compare the position of the 5' and 3' ends respectively (ignoring the end/start position respectively). If a F_ and a M_ read (for example) share a start position, the merged read will be selected and the F_ read discarded, with the assumption that the M_ read is of better quality. This is therefore taking a fall-back system similar to that of MarkDuplicates."

The TL;DR: DeDup is not valid for SE reads in any case and may accidently retain actual duplicates and skew the profiles. It's recommended in this case to use picard MarkDuplicates instead (which should be straightforward for you give you are using conda etc).

I wrote this after extensive discussion with @apeltzer (the original dev of DeDup), and we are going to add this to the DeDup repository now as this confusion has happened a lot of times.

antonisdim commented 3 years ago

Hello @jfy133 ,

I hope you are doing great and apologies for the super late reply !

Thank you for spotting this issue and for your suggestion.

I have updated the code to use picard MarkDuplicates for SE reads and dedup for collapsed PE reads. This feature will be included in the next version of HAYSTAC, that will be released shortly.

Please do not hesitate to contact us if you have any further feedback regarding HAYSTAC.

Thank you for your help ! Antony