CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
494 stars 190 forks source link

Can you use UMI-tools for paired end sequences with a plate index? #200

Closed justincg327 closed 4 years ago

justincg327 commented 7 years ago

I'm trying to use UMI-tools to analyze a paired end scRNA seq and could use a little insight. Two, 96-well plates were used with an 8 base pair cell barcode and an 8 base pair UMI added to the 5' end of the read1 sequences and a plate index was added to the to the 5' end of the read2 sequences. The plate index was added so that all the sequencing data could be pooled but which plate each cell came from could be determined. The cells in the first plate received the plate index 5' ATTACTCG 3' and the second plate received 5' TCCGGAGA 3'. I successfully created the whitelist.txt with cell barcodes and UMIs from hgmm_100_R1.fastq but am not sure how to get the plate indices out. Is there anyway that UMI-tools can help me extract this data in addition to the UMIs and cell barcodes? If not, does anyone have any recommendations?

Thanks so much, I'm a novice at single-cell sequencing and could use some help!

TomSmithCGAT commented 7 years ago

Hi @justincg327, just to make sure I understand you correctly, your barcodes take the form

read1 - CCCCCCCCNNNNNNNNXXX.... read2 - PPPPPPPPXXX...

where C = cell barcode, N = umi, P = plate, X=read sequence

I'm also assuming if you have used two 96-well plates that your whitelist should be a pre-defined set of 96 possible cell barcodes.

If this is all correct, the easiest thing to do would be to treat the plate barcode as part of the cell barcode. Since your plate barcode is in the second read, if you take each of the 8bp 96 possible cell barcodes and append the two plate barcodes, this will give you 192 possible 16bp cell-plate barcode combinations you expect to see. You can then run the command below to extract the barcodes and filter against your whitelist, replacing the parts in "[ ]".

umi_tools extract --stdin=[in_fastq.1] --read2-in=[in_fastq.2] --bc-pattern=CCCCCCCCNNNNNNNN --bc-pattern2=CCCCCCCC -L [out.log] --read2-out=[out_fastq.2] --filter-cell-barcode --whitelist=[whitelist.tsv] > [out_fastq.1]

Unfortunately, I discovered a bug in the released version of UMI-tools when using bc-pattern2. To get the above command to work, you will need to re-install from the master branch, or if you've installed via pip or conda, wait until the release of version 0.5.2 (before the end of the week hopefully).

The example below shows the expected output from the above command, where the barcodes are now encoded in the read identifiers in the format: [@read id]_CCCCCCCCPPPPPPPP_NNNNNNNN

read1 input

@SRR1784314.1 1 length=54
GAGGCGTTAGGAGTGATTGCTTGTGACGCCTTCTTCAGGCGATGACGTACTCCG
+SRR1784314.1 1 length=54
---,A+BF7,,,,C,,CE,CFE9C,,C,BFFFFFFF9,,C++@,,;,B,,9,6,

read2 input

@SRR1784314.1 2 length=80
TTTCTCCATCTTCTCCACACCCTTCCCCACCCCCTTCCCCCTACTTTCCCCTCCCTTCCCCCTCCCCCCCTCCTCCTCTT
+SRR1784314.2 2 length=80
666@,CC,,C,,C,6<,,,,,7,6<,,;,+,8+866B,;,+7,,,5;,,,,;,,7;,,;;,8>,,,,+786,6,,,,,94

read1 output:

@SRR1784314.1_GAGGCGTTTCTTCTCC_AGGAGTGA 1 length=54
TTGCTTGTGACGCCTTCTTCAGGCGATGACGTACTCCG
+
CE,CFE9C,,C,BFFFFFFF9,,C++@,,;,B,,9,6,

read2 output:

@SRR1784314.1_GAGGCGTTTCTTCTCC_AGGAGTGA 2 length=80
TCTTCTCCACACCCTTCCCCACCCCCTTCCCCCTACTTTCCCCTCCCTTCCCCCTCCCCCCCTCCTCCTCTT
+
,,,,,7,6<,,;,+,8+866B,;,+7,,,5;,,,,;,,7;,,;;,8>,,,,+786,6,,,,,94
justincg327 commented 7 years ago

Hi, so I attempted to redownload from the master branch and ran the whitelist and extract functions with the following inputs and got the consequent error:

umi_tools whitelist --stdin /Volumes/PromisePegasus/Justin/170530-singleCellSeqData/hgmm_100_R1.fastq.gz --bc-pattern=CCCCCCCCNNNNNNNN --bc-pattern2=CCCCCCCC --read2-in=/Volumes/PromisePegasus/Justin/170530-singleCellSeqData/hgmm_100_R2.fastq.gz --set-cell-number 96 --log2stderr > whitelist_real.tsv

umi_tools extract --stdin=/Volumes/PromisePegasus/Justin/170530-singleCellSeqData/hgmm_100_R1.fastq --read2-in=/Volumes/PromisePegasus/Justin/170530-singleCellSeqData/hgmm_100_R2.fastq --bc-pattern=CCCCCCCCNNNNNNNN -bc-pattern2=CCCCCCCC -L out.log --read2-out=hgmm_100_R2_extracted.fastq.gz --filter-cell-barcode --whitelist=/Volumes/PromisePegasus/Justin/UMI/tool_results/whitelist_real.tsv > hgmm_100_R1_extracted.fastq.gz

extract: error: no such option: -b

Have you seen this before?

Also, this may be off-topic but can I use umi-tools to generate FPKM values after generating the cell counts?

Thanks so much!

IanSudbery commented 7 years ago

you've missed out a - as indicated below.

pattern=CCCCCCCCNNNNNNNN -bc-pattern2=CCCCCCCC -L out.log --read2-
                         ^

You can you the count command to produce per-gene read counts of mapped reads. We don't provide an option to produce FPKMs, but the conversion shouldn't be hard to do your self. To convert these to FPKM you would need to know the length of each gene. A rough and ready FPKM would then be

tex2img

justincg327 commented 7 years ago

Thank you so much for your help, its working now!

dbrg77 commented 6 years ago

Hi,

On the same question, we are doing plate based methods, and our reads look similarly like these:

read1 - CCCCCCCCNNNNNNNNXXX....
read2 - PPPPPPPPXXX...

The exact sequences of CCCCCCCC and PPPPPPPP are all known. What does the whitelist.tsv in your exampple should look like? Should I still run umi_tools whitelist first to generate the whitelist.tsv first?

If I want to provide the whitelist.tsv myself (since we know the exact sequence of CCCCCCCC and PPPPPPPP), what should be included?

Thank you very much.

TomSmithCGAT commented 4 years ago

I'm closing this issue due to inactivity. For documentation on how to provide a whitelist to umi_tools extract, see https://umi-tools.readthedocs.io/en/latest/reference/extract.html