UMI_tools for C1-STRT scRNAseq data?

auesro commented 1 year ago

Dear all,

I am trying to analyze a public dataset, created with the C1-STRT protocol, using umi_tools in order to replicate a previous study analysis.

I got one fastq for each cell, lacking the cell barcode since they were already deposited this way. I got, according to the paper: 6 nucleotides UMI + 3-5 G nucleotides + cDNA (total read length 46 nucleotides). Since I lack the cell barcode, how could I approach the analysis with UMI_tools?

Thanks a lot!

A

TomSmithCGAT commented 1 year ago

Could you post the header of the fastq. It should be straightforward but I just want to confirm what you have

auesro commented 1 year ago

Sure, here you have one of them:

$ samtools view GSM2782726.fastq.gz | head -n 20
Run0258_BC56P9ACXX_L5_T1101_C39 4   *   0   0   *   *   0   0   ATCGATAGGCTGATCGAATTTCCTGGAGGAAAAGTTATTGTAGCTGTTTCA CCCFFF!2AFHHGIJJIIIJJJGIJJIDIFHIJIHHAGHGHGGGHIHIIIC
Run0258_BC56P9ACXX_L5_T1101_C159    4   *   0   0   *   *   0   0   GTTTATAGGGCTGTCTCTTATACACATCTGACGCGAGGCGTATCGTATGCC @@@FFF!2ADAHAHEEFHCHI@HCGCFEII>F<?D?DDAFEHCEHEA@BDF
Run0258_BC56P9ACXX_L5_T1101_C199    4   *   0   0   *   *   0   0   AGTTTTAGGCAGCGTCACGCTGAGGGACTCATCCTGGGCTACAATCCTATT @@?DFF!2ACDD=@GHIGIDFHEIIGEAGGIIGGI<9?BD=CGA<8@@AA>
Run0258_BC56P9ACXX_L5_T1101_C208    4   *   0   0   *   *   0   0   TGGGCTAGGGGGAATTCTGGACATTAATTAGGGCTGAAAGCCCTAACTTAA @@@?D;!2<AAFCFEGE<?>HFF399CGE>CEDFFFFFEG>GGIAFIFIIF
Run0258_BC56P9ACXX_L5_T1101_C248    4   *   0   0   *   *   0   0   GCAGGAAACTCAGCCCGAGGAAATCGCAGATAAGTTTTTAATTAAAAAGAT CCCFFF!2ADFHHJIJJIIIJJHHIJJJJGIIJIFIGBFHGIJJJIIJJJH
Run0258_BC56P9ACXX_L5_T1101_C249    4   *   0   0   *   *   0   0   TGTATCAGGGGGGAATTCTGGACATTAATTAGGGCTGAAAGCCCTAACTTA ?@:AB>!22+A=B8;?>ADD3>?BBB>AADBAB???68>A8A??<8?BBD>
Run0258_BC56P9ACXX_L5_T1101_C267    4   *   0   0   *   *   0   0   GGGCTGAGGACTTGGCTGGTATACTTTCTCTTTGGGTTGGGGTCAGAGTTG CC@FFF!4AFFFHJIJGHIEHBFHHIIIIJGGHIID;DBGHIFHCFBC8CC
Run0258_BC56P9ACXX_L5_T1101_C352    4   *   0   0   *   *   0   0   GGGCATAATCTGTCCTTATAGCTCATTAGGAAGAGAAACAGTGTTGTCAGG CCCFFF!2AFHHHJJJJJJJJJJJJJJJJJJJIHGIIIJJJEHJJJJIJJJ
Run0258_BC56P9ACXX_L5_T1101_C370    4   *   0   0   *   *   0   0   TTATACAGGGGGGTTGGGGATTTAGCTCAGTGGTAGAGCTGTCTCTTATAC CCCFFF!2ADFHD7BDDDDDBDDDDDDDDD@CDCCDDDDDDDCDDDDDDDD
Run0258_BC56P9ACXX_L5_T1101_C458    4   *   0   0   *   *   0   0   TGGACAAGATCCAGCAACTTATCATTTAAGTAAATAGTAAAATTAAAATCT ???DD;!42ADDDIIDBEIIDEF@FEIICEEEEEFIIF<DBDFICE4?DED
Run0258_BC56P9ACXX_L5_T1101_C471    4   *   0   0   *   *   0   0   TTAAAAAGGGGGGTTGGGGATTTAGCTCAGTGGTAGAGCGGTTTCCTAGGA @@@ADB!2ADFF?-6:16B@BCCA@CCCC?8>CC@88+()&&)&(++(((+
Run0258_BC56P9ACXX_L5_T1101_C496    4   *   0   0   *   *   0   0   AACGTAAGGATATACCGCCATCTTCAGCAAACCCTAAAAAGGTATTAAAGT CCCFFF!2ADHFHGJIJJJJJJJJJGJJJEHHIJJIJGGIIJDHGIBHIIB
Run0258_BC56P9ACXX_L5_T1101_C537    4   *   0   0   *   *   0   0   CATTTTATTTTTTTTTTTTTTTTTTTTTTTTTTTTCGGGGGGGGGGACCAA @BCFFF!2AFHHHJJJJJJJJHFDDDDDDDDD60&(&&))&0-6&&&&(((
Run0258_BC56P9ACXX_L5_T1101_C546    4   *   0   0   *   *   0   0   TTGTGCAGGGGACCACTAACCCATAACCATTTCATTTATCCCCATTATGCT CCCFFF!4ADHHHIJJGIJJIIGJIIIDGGIJIIIGIIJIDEHGHIGJJIF
Run0258_BC56P9ACXX_L5_T1101_C603    4   *   0   0   *   *   0   0   GCCTCTAGGCACTTAGATATTTTAAAGAGGCATCTATCACATAAGGCATCA CCCFFF!2AFHHHJJJJJJJJJJJIIJJJIIJIJJJJJJJJIJJIIJIJJG
Run0258_BC56P9ACXX_L5_T1101_C633    4   *   0   0   *   *   0   0   GGCTGAAGGGGGAATTCTGGACATTAATTAGGGCTGAAAGCCCTAACTTAA CCCFFD!2AFHHDIJJJJEIHIJJJJJJJJJIJJJJJ@HIFGIJEIJIJJC
Run0258_BC56P9ACXX_L5_T1101_C658    4   *   0   0   *   *   0   0   TTGGCGAGGAGGAGCTATAGAACTAGTACCGCAAGGGAAAGATGAAAGACT @C@FFF!2AADH1C=FGHGIGIIIIIHECHFDFHIIIDFHICEHEEHHFDG
Run0258_BC56P9ACXX_L5_T1101_C729    4   *   0   0   *   *   0   0   AGTATCAGGGGGGTTGGGGATTTAGCTCAGTGGTCTGTCTCTTATACACAT @@?BDD!2ACFAB5<@BBB?BCCC@8A@3>8@C(8AC3>C@AC388>>@C@
Run0258_BC56P9ACXX_L5_T1101_C795    4   *   0   0   *   *   0   0   AGCCTAAGGGTGTCTGTTTAGGAGTGAATTAGAACACTTAGTGATGTGGGA CCCFFF!2AFCFFHIJHIIJJJJJFGEIIJJIEGIHHIJJJHIJIIIJDI9
Run0258_BC56P9ACXX_L5_T1101_C799    4   *   0   0   *   *   0   0   CGGTTTATTTTTTTTTTTTTTTTTTTTTTTTTTTTAATTTTATTTTTTTTT ?:8BDD!2AFHHHIIIIIIIIHEBBBBBBBBBB6&((+(+3((+8ACBBBB

Thanks!

IanSudbery commented 1 year ago

If you have one fastq per cell, then its not necessary to have a cell barcode, because you can tell which cell a read comes from by which file it is in!

You should run extract on each fastq file with either --bc-pattern=NNNNNN or --bc-pattern='^(<umi_1>.{6})(<discard_1>G{3,5})' --extract-method=regex which would check to make sure the correct number of Gs was present.

Map the reads with your aligner of choice.

Run dedup on each of the aligned BAM files, and then run featureCounts on each of the deduplicated BAMs.

auesro commented 1 year ago

Hi @IanSudbery ,

Thanks for the explanation, it makes sense, I think I can make it work.

Question: why do I need to run dedup? I thought that was only necessary for non-single cell applications (I might be very well wrong, im more a biologist that bioinformatician)

And finally, how do I approach exon+intron counting with feautureCounts? I dont need to keep both exonic and intronic counts separated, just count both and get the total result (ala cellranger count --include-intros).

Thanks!

A

IanSudbery commented 1 year ago

What you are doing here looks far more like a large number of bulk RNA-seq samples than it does a single cell sample.

Normally when you do single cell, you would used umi_tools count which demultiplexes, deduplicates and counts all in one, taking in a fastq that has reads for all cells in one file and outputting a count matrix.

Because the demultiplexing step has already been carried out (mechanically - C1 cells are captured in different wells and independent libraries are created for each cell), you can't use count and will have to do the deduplicating and counting steps yourself on each sample seperately and combine the results.

In 3' tagging experiments, the UMI is added to the 5' end of the downstream (oligodT) primer during RT (i.e. the 3' end of the transcript). PCR then occurs and the other ends are generated by random fragmentation after PCR. That means that two fragments can have the same UMI, but different mapping locations and still be from the same original molecule. In this situation, you need --per-gene, and need to assign reads to genes before deduplicating. I can't quite follow the STRT-C1 protocol in the time i've got available, to see if this applies here as well, or if position is informative.

To do the exon and intron counting together you can pass -t transcript to featureCounts, as long as your gtf file has transcript lines (the ensembl one definitely does).

auesro commented 1 year ago

Hi @IanSudbery

Very detailed response, thanks a lot!

I was under the impression that I could run count without specifying --per-cell which would assume all reads as coming from the same "cell". I was probably wrong?

Yea, I am using the 10x Genomics reference, and it contains the transcript lines. That should do the trick.

TomSmithCGAT commented 1 year ago

Hi @auesro.

You can run count without specifying --per-cell. Indeed, in the original UMI-tools paper, that's what we did for the scRNA-Seq, since there were only 384 cells.

You can assign genes to the reads as detailed in our scRNA-Seq guide and follow @IanSudbery's suggestion to add -t transcript to count the reads in the introns. Then pass to umi_tools count without the --per-cell as you intended.

auesro commented 1 year ago

Thanks a lot, @TomSmithCGAT !!

CGATOxford / UMI-tools

UMI_tools for C1-STRT scRNAseq data? #589