LuyiTian / scPipe

a pipeline for single cell RNA-seq data analysis
69 stars 24 forks source link

UMI counts with N bases #133

Closed ljgearing closed 3 years ago

ljgearing commented 3 years ago

Dear LuyiTian et al.,

Thank you for developing this really useful package.

I recently had some data for which there was some N bases in the UMI sequence for about 25% of reads. Generally this was due to a drop in quality at the end of the reads: some UMI sequences ended in N, although a small proportion also ended in NN (etc.). Although, for this dataset, including such reads was not particularly important, I tried setting the rmN = FALSE option in the sc_trim_barcode() function, so that they were included in subsequent steps. At the final sc_gene_counting() step with UMI_cor = 1, reads with multiple N bases behave differently.

These sequences result in one UMI count, because there is only a single mismatch:

ENSMUSG00000102666,ACCCAACGCC,0
ENSMUSG00000102666,ACCCAACGCN,0

These sequences result in two UMI counts, because there are two mismatches, although the rest of the sequences and the positions are identical:

ENSMUSG00000102666,ATTATCCACT,32
ENSMUSG00000102666,ATTATCCANN,32

I was wondering whether there could be an option in sc_trim_barcode() to just permit a single N in the UMI sequence and remove any reads with two or more N bases. Alternatively, perhaps the distance measure used to compare sequences could take the presence of N bases into account, when calculating the counts in sc_gene_counting().

Thank you for your time. Best regards,

Jamie.

PS: Originally, I was using scPipe v. 1.8.0, but I have tested the sc_gene_counting() step using v. 1.14.0 and observed the same behaviour.

LuyiTian commented 3 years ago

Hi Jamie, sorry I did not follow the issue page recently. currently it is hard coded in the program so you cannot change the distance and it does not count N correctly in the matching step. for a simple fix you could trim the last base in the UMI sequence. the large amount of N is likely coused by sequencing issues, especially when you have polyA sequence afterward which is hard for Illumina to sequence. the base quality might drop significantly.

ljgearing commented 3 years ago

Dear Luyi, Thank you for your reply. I did try trimming the last base, as you suggested, and found that, for my data set at least, there was not much difference between the final UMI counts if I kept most sequences and used a 9-mer UMI or if I discarded the sequences with N bases and used the full 10-mer UMI. I think you are right that it was a sequencing issue, because the sequencing facility I have been working with recently changed their protocol slightly and the base calling in the UMI does not seem to be much of a problem anymore. Best regards, Jamie.