epi2me-labs / wf-single-cell

Other
69 stars 35 forks source link

Questions about the Barcode/UMI correction #136

Open Francais0620 opened 4 weeks ago

Francais0620 commented 4 weeks ago

Ask away!

To whom it may concern,

I am a user of wf-single-cell and I have some questions regarding the internal workings of the pipeline, specifically concerning the barcode and UMI correction steps. I would greatly appreciate your assistance in clarifying these matters.

  1. Barcode Correction The barcode sequence extracted from adapter1 based on its position and length is referred to as the "uncorrected barcode." This sequence is then filtered based on its sequencing quality and whether it appears in the known barcode file for 10X. Sequences meeting these criteria are termed "corrected barcodes" and are stored in the shortlist file. Is this correct?

    I noticed your documentation mentions: “This threshold is determined by ranking the cells by read count and taking the top n cells (n = expected_cells). The read count 95th percentile / 20 is the threshold used. This threshold can be visualized in the knee plots generated by the workflow.” I am not entirely clear on this explanation.

    There are two columns in the barcode file: one for barcode and one for count. Here, count refers to the number of reads associated with that barcode, i.e., the number of reads per cell. Suppose I set the parameter expected_cells to 100, which means I sort counts in descending order and select the top 100 cells. Using these 100 cells as a whole, the threshold for filtering is based on the 95th percentile of their counts divided by 20. Are the barcodes that meet this threshold then used for further filtering?

    Additionally, you consider barcodes not in the whitelist that may have been affected by minor indels during sequencing. These are compared to the whitelist barcodes, and the closest matching barcode is included. Is my understanding correct?

  2. UMI Correction For UMI correction, since UMIs are random sequences without a reference sequence, you rely on minimap alignment results. Based on these results, you cluster reads mapped to the same gene and extract the shared UMI information. Are there specific criteria used for filtering to determine which UMIs are considered corrected? Furthermore, does the UMI assignment depend on the genes included in the alignment? For example, if I use only a small subset of genes' GTF files as the index, does the number of identified UMIs depend on the number of genes that can be aligned?

  3. Metrics for Correction Proportions Is there a way to calculate the proportion of corrected UMIs relative to all incorrect UMIs during the filtering process? The same question applies to barcodes.

Thank you very much for your assistance!

Best regards,
Anne

nrhorner commented 3 weeks ago

Hi Anne,

Thanks for your questions

1: Barcode Correction

"The barcode sequence extracted from adapter1 based on its position and length is referred to as the "uncorrected barcode." This sequence is then filtered based on its sequencing quality and whether it appears in the known barcode file for 10X. Sequences meeting these criteria are termed "corrected barcodes" and are stored in the shortlist file. Is this correct?"

A: Sequences that match the quality criteria and have 100% match in the 10x whitelist are classed as high quality barcodes; barcodes that we can be confident of existing in the dataset. These are used to build the shortlist of all barcodes we can expect to see.

B: The threshold calculation is defined here: https://github.com/epi2me-labs/wf-single-cell/blob/b6ceb93c413f059b20d87eed07ec21046c201814/bin/workflow_glue/create_shortlist.py#L86 The threshold calculated can be visualised in the kneeplot generated in the report as the vertical line, where cells to the right of this are filtered out

C: Additionally, you consider barcodes not in the whitelist that may have been affected by minor indels during sequencing. These are compared to the whitelist barcodes, and the closest matching barcode is included. Is my understanding correct ? Yes but there are some extra criteria, see this from the workflow overview.


    the query and closest-matched shortlist barcode have an edit distance <= 2
    The edit distance difference between the top and hit and second top hit in the shortlist >= 2

2: UMI Correction Please see section 7 in https://github.com/epi2me-labs/wf-single-cell/blob/master/docs/08_pipeline_overview.md

  1. Metrics for Correction Proportions You can look at read_tags.tsv (will be renamed to read_summary.tsv soon) at count where uncorrected_umi and corrected_umi are different and corrected_barcode and uncorrected_barcode are different.

I hope this helps,

Neil

MenglinC commented 2 weeks ago

Dear Neil,

Thank you very much for your response; it has been very helpful to me.

I now need to address another issue. In my final HTML results, I noticed a significant drop in the ratio of full-length to tagged reads, as well as a steep decline from gene to transcript ratios. bokeh_plot

  1. For full-length to tagged: I understand that among these full-length reads (I have a 3' end library, meaning full-length reads are those that have both complete read1 and TSO sequences), only 40% of the reads have a corrected barcode and corrected UMI, which is what "tagged" refers to?Is my understanding correct?

  2. For gene to transcript: My understanding is that if a read belongs to a gene, it must also belong to a specific transcript. However, when I checked the read_tags.tsv file, I found that this is not always the case. I discovered that there are still a large number of reads that only belong to a gene but not to any transcript of that gene, and the same situation applies to transcripts. Therefore, I would like to understand the situation of these reads that only align to a gene or only align to a transcript. Specifically, why does my data show such a significant drop from gene to transcript? What characteristics does this data indicate?

Attached below is a portion of my read_tags.tsv file for your review.

a gene but not to any transcript of that gene 3a03c306-9f94-43c0-a4b8-92bec0f2a0d8_0 GTCAAGTGCTGAGGACT GTCAAGTGCTGAGGACT 6222277:7774((('' ATGAACCCGCTG ATGAACCCGCTG ,-44/0//5335 PRRC2B - 131394112 13146760 chr9 ab6a6f1c-9259-4969-8f7a-da73c2dd61be_0 TGTCCTGTACAACTAC TGTCCTGTACAACTACG ;KKHGEDSKHGEBADD GTGGAGCCATTC GTGGAGCCATTC B@DDHFEJIIJH AGTPBP1 - 85677502 8574192chr9

a transcript but not to any gene of that transcript dbc0e48f-63d6-4364-99cc-0ee69882a6da_0 TCAGCAAGCTCATCGCG TCAGCAAGCTCATCGCG 4987653410//..001 CTTCGGGTCCAG GCTTGGGTCCAG ///045667544 - ENST00000265381.7 694421669442902 chr9 62271aea-81ac-4e7c-b4dc-585a7c05bfce_0 TGTTCTACGATGCGTGC TGTTCTACGATGCGTGC KBBB@?BABDCBBCFDE GATCCCTAACCG GATCCCTAACCG EHCDDBBCDDEK - ENST00000374874.8 10102876 101270061 chr9 7c3f60a7-4fa2-46d0-8ed5-9f55c43094a0_0 AAGCGTTCGAGATACCT AAGCGTTCGAGATACCT DEDCBEC96557>?664 CTGCAGTCCTCT CTGCAGTCCTCT 8?CBDBCCBHDG - ENST00000381401.11_PAR_Y 1386165 1392113 chrY 2d242fe8-cbe0-4c6a-bab0-a3cae9e985fb_0 GCTCAAACGATACGTAC GCTCAAACGATACGTAC KF:11MSSFIDCFBG>5 TTCCCGTCATGG TTCCCGTCATGG 22163,ABCKIG - ENST00000381401.11_PAR_Y 1386572 1391998 chrY e404cc17-24a0-4bac-a1ed-9fbd09ce14e1_0 ATTCACTATGTCATAG ATTCACTATGTCATAGG SLHGABBEIDFFEGCB GTGTTCCCCGGG GTGTTCCCCGGG BBIKSKGCCBEB - ENST00000381401.11_PAR_Y 1386573 1392113 chrY 1d25b29d-5da3-476c-a9d9-ab03e57fe48b_0 GATCGTATACTTCCAG GATCGTATACTTCCAGT 4331211486956611 TTGTTACCGTGC TTGTTACCGTGC 004433448733 - ENST00000331035.10_PAR_Y 1419819 1420852 chrY e1a7475c-38aa-44a0-b9e1-ef2ef4386bcc_0 GGCTGTGTACAATCTAG GGCTGTGTACAATCTAG LEJDGEGEDMFEHGSFS CGTCATCACGAG CGTCATCACGAG SSNMEFCDCHGG - ENST00000381401.11_PAR_Y 1387395 1392113 chrY