Closed justincg327 closed 4 years ago
Hi @justincg327, just to make sure I understand you correctly, your barcodes take the form
read1 - CCCCCCCCNNNNNNNNXXX.... read2 - PPPPPPPPXXX...
where C = cell barcode, N = umi, P = plate, X=read sequence
I'm also assuming if you have used two 96-well plates that your whitelist should be a pre-defined set of 96 possible cell barcodes.
If this is all correct, the easiest thing to do would be to treat the plate barcode as part of the cell barcode. Since your plate barcode is in the second read, if you take each of the 8bp 96 possible cell barcodes and append the two plate barcodes, this will give you 192 possible 16bp cell-plate barcode combinations you expect to see. You can then run the command below to extract the barcodes and filter against your whitelist, replacing the parts in "[ ]".
umi_tools extract --stdin=[in_fastq.1] --read2-in=[in_fastq.2] --bc-pattern=CCCCCCCCNNNNNNNN --bc-pattern2=CCCCCCCC -L [out.log] --read2-out=[out_fastq.2] --filter-cell-barcode --whitelist=[whitelist.tsv] > [out_fastq.1]
Unfortunately, I discovered a bug in the released version of UMI-tools when using bc-pattern2. To get the above command to work, you will need to re-install from the master branch, or if you've installed via pip
or conda
, wait until the release of version 0.5.2 (before the end of the week hopefully).
The example below shows the expected output from the above command, where the barcodes are now encoded in the read identifiers in the format: [@read id]_CCCCCCCCPPPPPPPP_NNNNNNNN
read1 input
@SRR1784314.1 1 length=54
GAGGCGTTAGGAGTGATTGCTTGTGACGCCTTCTTCAGGCGATGACGTACTCCG
+SRR1784314.1 1 length=54
---,A+BF7,,,,C,,CE,CFE9C,,C,BFFFFFFF9,,C++@,,;,B,,9,6,
read2 input
@SRR1784314.1 2 length=80
TTTCTCCATCTTCTCCACACCCTTCCCCACCCCCTTCCCCCTACTTTCCCCTCCCTTCCCCCTCCCCCCCTCCTCCTCTT
+SRR1784314.2 2 length=80
666@,CC,,C,,C,6<,,,,,7,6<,,;,+,8+866B,;,+7,,,5;,,,,;,,7;,,;;,8>,,,,+786,6,,,,,94
read1 output:
@SRR1784314.1_GAGGCGTTTCTTCTCC_AGGAGTGA 1 length=54
TTGCTTGTGACGCCTTCTTCAGGCGATGACGTACTCCG
+
CE,CFE9C,,C,BFFFFFFF9,,C++@,,;,B,,9,6,
read2 output:
@SRR1784314.1_GAGGCGTTTCTTCTCC_AGGAGTGA 2 length=80
TCTTCTCCACACCCTTCCCCACCCCCTTCCCCCTACTTTCCCCTCCCTTCCCCCTCCCCCCCTCCTCCTCTT
+
,,,,,7,6<,,;,+,8+866B,;,+7,,,5;,,,,;,,7;,,;;,8>,,,,+786,6,,,,,94
Hi, so I attempted to redownload from the master branch and ran the whitelist and extract functions with the following inputs and got the consequent error:
umi_tools whitelist --stdin /Volumes/PromisePegasus/Justin/170530-singleCellSeqData/hgmm_100_R1.fastq.gz --bc-pattern=CCCCCCCCNNNNNNNN --bc-pattern2=CCCCCCCC --read2-in=/Volumes/PromisePegasus/Justin/170530-singleCellSeqData/hgmm_100_R2.fastq.gz --set-cell-number 96 --log2stderr > whitelist_real.tsv
umi_tools extract --stdin=/Volumes/PromisePegasus/Justin/170530-singleCellSeqData/hgmm_100_R1.fastq --read2-in=/Volumes/PromisePegasus/Justin/170530-singleCellSeqData/hgmm_100_R2.fastq --bc-pattern=CCCCCCCCNNNNNNNN -bc-pattern2=CCCCCCCC -L out.log --read2-out=hgmm_100_R2_extracted.fastq.gz --filter-cell-barcode --whitelist=/Volumes/PromisePegasus/Justin/UMI/tool_results/whitelist_real.tsv > hgmm_100_R1_extracted.fastq.gz
extract: error: no such option: -b
Have you seen this before?
Also, this may be off-topic but can I use umi-tools to generate FPKM values after generating the cell counts?
Thanks so much!
you've missed out a -
as indicated below.
pattern=CCCCCCCCNNNNNNNN -bc-pattern2=CCCCCCCC -L out.log --read2-
^
You can you the count
command to produce per-gene read counts of mapped reads. We don't provide an option to produce FPKMs, but the conversion shouldn't be hard to do your self. To convert these to FPKM you would need to know the length of each gene. A rough and ready FPKM would then be
Thank you so much for your help, its working now!
Hi,
On the same question, we are doing plate based methods, and our reads look similarly like these:
read1 - CCCCCCCCNNNNNNNNXXX....
read2 - PPPPPPPPXXX...
The exact sequences of CCCCCCCC
and PPPPPPPP
are all known. What does the whitelist.tsv
in your exampple should look like? Should I still run umi_tools whitelist
first to generate the whitelist.tsv
first?
If I want to provide the whitelist.tsv
myself (since we know the exact sequence of CCCCCCCC
and PPPPPPPP
), what should be included?
Thank you very much.
I'm closing this issue due to inactivity. For documentation on how to provide a whitelist to umi_tools extract
, see https://umi-tools.readthedocs.io/en/latest/reference/extract.html
I'm trying to use UMI-tools to analyze a paired end scRNA seq and could use a little insight. Two, 96-well plates were used with an 8 base pair cell barcode and an 8 base pair UMI added to the 5' end of the read1 sequences and a plate index was added to the to the 5' end of the read2 sequences. The plate index was added so that all the sequencing data could be pooled but which plate each cell came from could be determined. The cells in the first plate received the plate index 5' ATTACTCG 3' and the second plate received 5' TCCGGAGA 3'. I successfully created the whitelist.txt with cell barcodes and UMIs from hgmm_100_R1.fastq but am not sure how to get the plate indices out. Is there anyway that UMI-tools can help me extract this data in addition to the UMIs and cell barcodes? If not, does anyone have any recommendations?
Thanks so much, I'm a novice at single-cell sequencing and could use some help!