Hoohm / dropSeqPipe

A SingleCell RNASeq pre-processing snakemake workflow
Creative Commons Attribution Share Alike 4.0 International
147 stars 47 forks source link

How are R1 reads filtered #88

Closed seb-mueller closed 4 years ago

seb-mueller commented 5 years ago

Just been reviewing the code for R1 read filtering. In my understanding, a R1 read should be filtered if an adapter reaches into the barcode and/or UMI region, right.

According to the code, filtering is done using the code below for rule rule cutadapt_R1::

https://github.com/Hoohm/dropSeqPipe/blob/9099c5995db8054ec11ca3e5492c268bca308805/rules/filter.smk#L30-L37

--overlap is described as follows in the cutadapt manual:

Minimum overlap (reducing random matches)

Since Cutadapt allows partial matches between the read and the adapter sequence, short matches can occur by chance, leading to erroneously trimmed bases. For example, roughly 25% of all reads end with a base that is identical to the first base of the adapter. To reduce the number of falsely trimmed bases, the alignment algorithm requires that, by default, at least three bases match between adapter and read.

This minimum overlap length can be changed globally (for all adapters) with the parameter --overlap (or its short version -O). Alternatively, use the adapter-specific parameter min_overlap to change it for a single adapter only. Example: -a "ADAPTER;min_overlap=5" (the quotation marks are necessary).

If a read contains a partial adapter sequence shorter than the minimum overlap length, no match will be found (and therefore no bases are trimmed).

The way I read this, is a read is trimmed only if adapter overlaps at least as much as the length of the barcode. I don't think this would be sensible. Shouldn't the adapter trimmed with any overlap (e.g. `--overlap=1 and the fitlering be done using the minimum length, e.g. being filtered if trimmed length < barcode+UMI?

Hoohm commented 4 years ago

This has been changed in 0.5. It now is based on the index of the last UMI base