Closed ljgearing closed 3 years ago
Hi Jamie, sorry I did not follow the issue page recently. currently it is hard coded in the program so you cannot change the distance and it does not count N correctly in the matching step. for a simple fix you could trim the last base in the UMI sequence. the large amount of N is likely coused by sequencing issues, especially when you have polyA sequence afterward which is hard for Illumina to sequence. the base quality might drop significantly.
Dear Luyi, Thank you for your reply. I did try trimming the last base, as you suggested, and found that, for my data set at least, there was not much difference between the final UMI counts if I kept most sequences and used a 9-mer UMI or if I discarded the sequences with N bases and used the full 10-mer UMI. I think you are right that it was a sequencing issue, because the sequencing facility I have been working with recently changed their protocol slightly and the base calling in the UMI does not seem to be much of a problem anymore. Best regards, Jamie.
Dear LuyiTian et al.,
Thank you for developing this really useful package.
I recently had some data for which there was some N bases in the UMI sequence for about 25% of reads. Generally this was due to a drop in quality at the end of the reads: some UMI sequences ended in N, although a small proportion also ended in NN (etc.). Although, for this dataset, including such reads was not particularly important, I tried setting the
rmN = FALSE
option in thesc_trim_barcode()
function, so that they were included in subsequent steps. At the finalsc_gene_counting()
step withUMI_cor = 1
, reads with multiple N bases behave differently.These sequences result in one UMI count, because there is only a single mismatch:
These sequences result in two UMI counts, because there are two mismatches, although the rest of the sequences and the positions are identical:
I was wondering whether there could be an option in
sc_trim_barcode()
to just permit a single N in the UMI sequence and remove any reads with two or more N bases. Alternatively, perhaps the distance measure used to compare sequences could take the presence of N bases into account, when calculating the counts insc_gene_counting()
.Thank you for your time. Best regards,
Jamie.
PS: Originally, I was using scPipe v. 1.8.0, but I have tested the
sc_gene_counting()
step using v. 1.14.0 and observed the same behaviour.