arshajii / ema

Fast & accurate alignment of barcoded short-reads
http://ema.csail.mit.edu
MIT License
32 stars 7 forks source link

Trimming 1bp of R2 #14

Closed vladsavelyev closed 6 years ago

vladsavelyev commented 6 years ago

As far as I understand, you trim 16+7 based from the first read only, and leave the mate read alone:

In summary, in the barcode extraction stage, we remove the 16bp barcode from the first mate of each read pair, and trim an additional 7bp to account for potential ligation artifacts resulting from the barcode ligation process during sequencing (the second mate shares the same barcode as the first mate).

The Longranger authors actually recommend additionally trimming the first base of R2: https://community.10xgenomics.com/t5/Genome-Exome-Forum/Best-practices-for-trimming-adapters-when-variant-calling/td-p/470

In terms of trimming, we recommend trimming the first 16+7bp of R1, and the first 1bp of R2. R1 contains the 16bp 10x barcode + 7bp of low accuracy sequence from an N-mer oligo. The first bp of R2 empirically has about a 5x higher mismatch rate. Given the stats you're showing, I don't expect the trimming to have a huge influence -- my guess is that you'll get the biggest win from filtering poor variants.

Have you guys considered that by any chance? Wondering if it would improve speed or quality at all.

vladsavelyev commented 6 years ago

Hmm, actually, manually looking at Longranger's BAM - they do not seem be actually trimming that base :)

inumanag commented 6 years ago

AFAIK second mates are not supposed to be trimmed (at least when we were evaluating Long Ranger that was not the case).

vladsavelyev commented 6 years ago

Well, I just pointed to their recommendation - The first bp of R2 empirically has about a 5x higher mismatch rate. Shouldn't make much difference, but just was wondering if you explored that by any chance.

arshajii commented 6 years ago

We haven't actually looked at that, but thanks for the reference. In general, we've been trying to be as close to Long Ranger as possible when it comes to stuff like this, which is why we basically just followed their preprocessing formula exactly. In terms of trimming that base, I'd be very surprised if it made any sort of noticeable difference, since it's very unlikely to change any of the candidate alignments (maybe this is why the Long Ranger authors aren't following their own recommendation 😄). The only other possibility would be that it may change the initial alignment scores, but even then it seems unlikely that it'd produce any substantial changes (especially since 10x reads intrinsically have higher error rates towards their ends anyway, compared to e.g. standard Illumina reads -- we take this into account in EMA by having a lower clipping penalty). In any case, it's definitely something to keep in mind; maybe we can make the trimming parameters user-specified with Long Ranger's as the default.

vladsavelyev commented 6 years ago

Thanks a lot for the detailed answer! That totally makes sense - I agree that there is no reason why this extra read might affect anything. I'd leave it as is. Unless someone else requests this in the future, I don't think there is a need now to make trimming parameters customizable. Especially since LongRanger guys don't seem to stick to their own recommendation 😃