LuyiTian / scPipe

a pipeline for single cell RNA-seq data analysis
67 stars 24 forks source link

Meaning of `UMI_cor = 2` in sc_gene_counting() #93

Open PeteHaitch opened 5 years ago

PeteHaitch commented 5 years ago

I'm having a hard time understanding the documentation of UMI_cor = 2 in sc_gene_counting()

correct UMI sequencing error: 0 means no correction, 1 means simple correction and merge UMI with distance 1. 2 means merge on both UMI alignment position match.

Could you please clarify?

LuyiTian commented 5 years ago

for UMI_cor = 1, all UMIs that mapped to the same genes are grouped together and duplicated UMIs are removed.

for UMI_cor = 2, all UMIs that mapped to the same genes and in the same positions are grouped together and duplicated UMIs are removed. so UMI_cor assume that one molecule, after amplification, would only generate the same fragment. Later on, I realized it is rarely the case and most protocols, including 10X and CEL-seq2, involves pre-amplification before the full-length cDNAs are cut down to fragments. So there could be more than one fragment for the one mRNA molecule. I did'not delete it in case it is useful in some special situation. But for the most time, it should not be used.

PeteHaitch commented 5 years ago

Thanks, Luyi. So UMI_cor = 1 is the recommended value? I think it would be useful to update the documentation with those extra details and ensure the default value matches the most common protocol(s). Aside: a description of UMI_cor = 2 is missing from create_report(): https://github.com/LuyiTian/scPipe/blob/02e97841332bb616bab76e9d3ff34d0000e6bb21/R/sc_workflow.R#L203-L211