Closed crazyhottommy closed 4 months ago
Thanks for the report, @crazyhottommy! Pinging @DongzeHE here to help work toward a quick resolution :).
Hello @crazyhottommy,
Thank you very much for choosing alevin-fry.
Yes, processing 10X Feature barcoding is a little bit complicated because it uses a different set of barcodes than GEX (see here). Therefore, you could not directly use the 10x whitelist, in your case, passing --unfiltered-pl
without specifying a path, when calling alevin-fry/simpleaf. There are two ways to overcome this:
CG000193_Barcode_Whitelist_forCustom_Feature_Barcoding conjugates_RevA.txt
on this page, pass this to --unfiltered-pl
, and later map feature barcodes (the second column) to 10x barcodes (the first column), or vise versa, using this "translation" file (https://github.com/10XGenomics/cellranger/raw/master/lib/python/cellranger/barcodes/translation/3M-february-2018.txt.gz). By doing so, you will get two lists of unfiltered barcodes representing the two modalities. You can then take the overlap and then call high-quality cells using emptyDrops or any methods you like.Hope the above can help you to solve the problem. Please let us know if there are any other questions!
Best, Dongze
Thanks @rob-p and @DongzeHE for the quick reply! I am aware that the feature barcode is different from the RNA. I previously wrote a series of posts:
https://divingintogeneticsandgenomics.com/post/how-to-use-salmon-alevin-to-preprocess-cite-seq-data/
For this PBMC dataset, I did not provide the --unfiltered-pl with a path for the ADT data, and it worked fine and later I translated the barcode as you suggested shown below. That dataset is 3' 10xV3 though (note, I specified --expected-ori fw)
Thanks for your answer, and I will try it again with your suggestion.
PS, I did read the Solution 2 tutorial before, and from "only" my experience, I still prefer to use explicit commands rather than a pre-configured workflow:)
Best, Tommy
Oh Ok, if it is CITE-seq instead of 10X feature barcode, then the problem might not come from the barcodes.
@rob-p : can simpleaf process 10x5' v2 data now?
Hi @crazyhottommy, after going through your command again, there are two things we can check:
--expected-ori rc
when processing ADT because based on my experience, the orientation of feature barcodes is forward. Could you try --expected-ori fw
?--chemistry 10xv2
, I guess simpleaf will try to map the entire read2 against the index. Also, I noticed that almost all read2s end with AGATCTGAAGAGAGTCG
or similar sequences with a very small hamming distance. If the data is from an assay that uses short feature barcodes, for example, the one described at here, as the read2s include the feature barcode and other oligonucleotides, I think we need to make sure that only the "barcode" part is used for mapping. Let me know if there are any doubts. Thank you very much!
Best, Dongze
Right! So we can process 5β data, but currently only read 1 is used for biological mapping. Incorporating paired end constraints from what remains of read 2 is an imminent feature and should land soon, but right now the thing to do is treat the barcode read in 5β as essentially fully technical. We should have an FAQ about this (and update it as new features land).
Thanks, @rob-p and @DongzeHE Things I have tried:
simpleaf quant --reads1 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R1_001.fastq.gz --reads2 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R2_001.fastq.gz --threads 16 --index /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/adt_index --chemistry 10xv2 --resolution cr-like --expected-ori rc --unfiltered-pl CG000193_Barcode_Whitelist_forCustom_Feature_Barcoding_conjugates_RevA.txt --t2g-map /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/t2g_adt.tsv --output /mnt/disks/tommy/af_test_workdir/NLS164_adt_whitelist_quant
--expected-ori fw
Btw, at least for the RNA data, I read https://github.com/COMBINE-lab/alevin-fry/issues/118 for 10x5'v2, specify -d rc. https://github.com/COMBINE-lab/salmon/discussions/674without whitelist
simpleaf quant --reads1 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R1_001.fastq.gz --reads2 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R2_001.fastq.gz --threads 16 --index /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/adt_index --chemistry 10xv2 --resolution cr-like --expected-ori fw --unfiltered-pl --t2g-map /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/t2g_adt.tsv --output /mnt/disks/tommy/af_test_workdir/NLS164_adt_fw_quant
This gives me 70019 unfiltered cells, which seems to be right.
and with whitelist:
simpleaf quant --reads1 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R1_001.fastq.gz --reads2 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R2_001.fastq.gz --threads 16 --index /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/adt_index --chemistry 10xv2 --resolution cr-like --expected-ori fw --unfiltered-pl CG000193_Barcode_Whitelist_forCustom_Feature_Barcoding_conjugates_RevA.txt --t2g-map /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/t2g_adt.tsv --output /mnt/disks/tommy/af_test_workdir/NLS164_adt_whitelist_fw_quant
This gives me 0 cells too.
@DongzeHE "I saw that the read length of your read2 file is 90. As you specified --chemistry 10xv2, I guess simpleaf will try to map the entire read2 against the index."
Read2 is 90 bases, and the feature barcode is short:
cat adt.tsv
CCR7 AGTTCAGTCAACCGA
CD45RA TCAATCCTTCCGCTT
CD45RO CTCCGAATCATGTTG
CD161 GTACGCAGTCCTTCT
CD8A GCTGCGCTTTCCATT
IgG CTGGAGCGATTAGAA
PD1 ACAGCGCCGTATTTA
How should I specify the command to restrict it to only the feature barcode length?
@rob-p "the thing to do is treat the barcode read in 5β as essentially fully technical. " How should one specify it in the command?
It is a bit confusing for me to specify the right arguments for different technologies, 10xv2, 10xv3, it will be great to have a FAQ on this!
Thanks again for this awesome tool!
Tommy
Hi @crazyhottommy, cool! looks like we are very close to the answer! Based on my understanding of this excellent post, simpleaf can handle 10x Chromium V5 VDJ because it is very similar to the 3' kits. It could be the case that we need to customize it, for example specifying rc
as the orientation when processing the GEX modality. You can confirm this by specifying orientation as fw
rc
and both
and comparing them. I will also do an investigation by myself once I get some free cycles. (Job hunting is exhausting ;P)
use the white list, but now it gives me no cells...
This means that the feature barcode file only works for 10X barcode assays, not CITE-seq. I apologize for the wrong information I provided. We should not use the feature barcode file provided by 10X.
This gives me 70019 unfiltered cells, which seems to be right.
The increased number of unfiltered cells when specifying the orientation as forward suggests that the problem actually comes from the orientation. However, I am still unsure if mapping the whole read2 to the index makes sense and why the reads can be mapped if they are longer than the indexed sequences. Could you please check the mapping rate? you can find this in the log (json) file exported by simpleaf quant
.
@rob-p What do you think?
How should I specify the command to restrict it to only the feature barcode length?
Unfortunately, I think we need to do this using a custom command. To be specific, according to the structure of the feature library, feature barcode starts at the 11th base and of length 15. I confirmed with the read2 file you shared.
Therefore, what we can do is write a small awk command to take the part we want. Here I show an example.
zcat read2.fastq.gz | awk '{if (NR%4==2 || NR%4 == 0) {print substr($0,11,15)} else {print $0}}' > read2_fb_only.fastq
gzip read1_fb_only.fastq
By doing so, you will get a new file named read2_fb_only.fastq.gz
containing only the feature barcode sequences.
The processed read2 will look like this
I am sure that this file can be correctly processed by simpleaf now!
Best, Dongze
@DongzeHE : why can't we use a custom geometry flag, and specify the remainder of the read as x:
. Your solution will likely work, but we can handle this already with custom geometry I think and without the need for external processing with awk.
Great, we are making progress! @DongzeHE good luck with your job hunting and thanks for helping out!
Let me clarify:
This is the 10xv3 pbmc protein ADT fastq R2 reads downloaded from https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_protein_v3/pbmc_1k_protein_v3_fastqs.tar as in the alevin tutorial
@A00228:290:H3FVWDRXX:1:1101:22923:1047 2:N:0:CAGTACTG
TTTCTATCAAGAAAGTCAAAGCACTGCGTTGGTTGCTTTAAGGCCGGTCCTAGCAATCAAAGTATTATGCTTTCGACCCAATACCTGTCTC
+
FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00228:290:H3FVWDRXX:1:1101:9661:1063 2:N:0:CAGTACTG
GGGTGTTCACCCCTGTCTCTTATACACATCTGACGCTGCCGACGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAGGGGGGGGGGGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFFFFF:::,,,FFFFFFFFFF
zless -S pbmc_1k_protein_v3_antibody_S2_L001_R2_001.fastq.gz | head -2 | tail | wc -L
91
It is 91 bases.
cat adt.tsv
CD3 AACAAGACCCTTGAG
CD4 TACCCGTAATAGCGT
CD8a ATTGGCACTCAGATG
CD14 GAAAGTCAAAGCACT
CD15 ACGAATCAATCTGTG
CD16 GTCTTTGTCAGTGCA
CD56 GTTGTCCGACAATAC
CD19 TCAACGCTTGGCTAG
CD25 GTGCATTCAACAGTA
CD45RA GATGAGAACAGGTTT
CD45RO TGCATGTCATCGGTG
PD-1 AAGTCGTGAGGCATG
TIGIT TGAAGGCTCATTTGT
CD127 ACATTGACGCAACTA
IgG2a CTCTATTCAGACCAG
IgG1 ACTCACTGGAGTCTC
IgG2b ATCACATCGTTGCCA
The CD3 feature barcode is in the middle of R2 reads
and I used
simpleaf quant \
--reads1 $reads1 \
--reads2 $reads2 \
--threads 16 \
--index $AF_SAMPLE_DIR/data/adt_index \
--chemistry 10xv3 --resolution cr-like \
--expected-ori fw --unfiltered-pl \
--t2g-map $AF_SAMPLE_DIR/data/t2g_adt.tsv \
--output alevin_adt
in https://divingintogeneticsandgenomics.com/post/how-to-use-salmon-alevin-to-preprocess-cite-seq-data/ without any problems.
Why in 10x5' v2 (my case), one needs to substr
the feature barcode as you showed using awk
?
The output by specifying --expected-or fw
without using the whitelist looks good to me, I checked the mtx file and I get many non-zero counts.
simpleaf quant --reads1 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R1_001.fastq.gz --reads2 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R2_001.fastq.gz --threads 16 --index /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/adt_index --chemistry 10xv2 --resolution cr-like --expected-ori fw --unfiltered-pl --t2g-map /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/t2g_adt.tsv --output /mnt/disks/tommy/af_test_workdir/NLS164_adt_fw_quant
btw, which log should I check for the mapping rate?
ls *
simpleaf_quant_log.json
af_map:
alevin aux_info cmd_info.json libParams logs map.rad unmapped_bc_count.bin
af_quant:
alevin featureDump.txt map.collated.rad permit_map.bin unmapped_bc_count_collated.bin
collate.json generate_permit_list.json permit_freq.bin quant.json
Thanks again!
my current conclusions:
for 10x5' v2, for the RNA, you need to specify --expected-ori rc
for 10x5' v2, for the protein, you need to specify --expected-ori fw
for 10x3' v3 for the RNA and protein, you need to specify --expected-ori fw
.
Although I still need to understand why it is the case...
Hi @rob-p, here the problem is that the actual feature barcodes that should be used for mapping are contained within the 90 bases' Read2. can we use customer geometry flag to process that?
just updated my answer in this thread:)
Hi @crazyhottommy,
Why in 10x5' v2 (my case), one needs to substr the feature barcode as you showed using awk?
Yes, the results look good to me. The confusion is from me about how salmon maps sequences that are longer than the indexed sequences. So, @rob-p : say we indexed some feature barcodes of length 15. In a later stage, we need to map Read2s (of length 90) against the index, how does salmon handle this?
btw, which log should I check for the mapping rate?
To check the mapping rate, in your case, you should check this file:
/mnt/disks/tommy/af_test_workdir/NLS164_adt_fw_quant/af_map/salmon_quant.log
If, in the future, you use piscem
, which is the latest (and faster) mapper we provide in our ecosystem, you can find the log in simpleaf_quant_output_dir/af_map/map_info.json
my current conclusions:
I agree with your conclusions and have the same confusion. From their chemistry specification (v3 and v5), I cannot tell why we need to use rc
for the 5' assay. I will also investigate this once I have free cycles.
Thank you so much for bearing with us. We will work on providing FAQs soon.
Best, Dongze
So, @rob-p : say we indexed some feature barcodes of length 15. In a later stage, we need to map Read2s (of length 90) against the index, how does salmon handle this?
@DongzeHE: Yes; we can do this. If I understand properly, you're saying the read is longer than it needs to be (90bp), when we need to map only 15bp of it? In that case we can use a custom geometry that specifies that the read consists of only the length 15 sequence of interest. Something like 2{x[10]r[15]x:}
to designate we should skip the first 10 bases, use the next 15 as the read, and discard whatever remains. Does that make sense?
Ahhh, right! I lost my mindπ
. So in this case, we can do 1{b[16]u[10]x:}2{x[10]r[15]x:}
and pass this to --chemistry
. We do not need to use the awk command.
Then, from the example provided by @crazyhottommy, when the indexed feature barcodes are 15bp and read2s are 91bp, providing --chemistry 10xv3
without saying the configuration of read2 was enough for salmon to map reads. What's the magic happening there?
So I'm not sure if the underlying mapper being used here is salmon or piscem. But either way, the mapping process is pretty robust to "aberrant" sequence. So as long as the rest of the read isn't spuriously mapping against the (feature barcode) index, those reads are probably still getting mapped. Regardless, we should probably ignore the parts of the reads we know aren't meaningful anyway ;P.
Yup yup! Thank you very much for the explanations! π
okay, this is very helpful! so the correct command should be:
simpleaf quant --reads1 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R1_001.fastq.gz --reads2 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R2_001.fastq.gz \
--threads 16 --index /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/adt_index \
--chemistry 1{b[16]u[10]x:}2{x[10]r[15]x:} --resolution cr-like --expected-ori fw --unfiltered-pl --t2g-map /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/t2g_adt.tsv --output /mnt/disks/tommy/af_test_workdir/NLS164_adt_fw_quant
somehow specifying --chemistry 10xv2 works too.
UPDATE Feb 20, 2024. This is not right. The counts are very big for many of the ADTs.
specify the chemistry using the geometry:
simpleaf quant \
--reads1 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R1_001.fastq.gz \
--reads2 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R2_001.fastq.gz \
--threads 16 --index /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/adt_index \
--chemistry 1{b[16]u[10]x:}2{x[10]r[15]x:} --resolution cr-like --expected-ori fw \
--unfiltered-pl /mnt/disks/tommy/af_test_workdir/737K-august-2016.txt \
--t2g-map /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/t2g_adt.tsv \
--output /mnt/disks/tommy/af_test_workdir/NLS164_adt_fw_quant
One has to specify the 10x5' v2 whitelist explicitly. https://[kb.10xgenomics.com/hc/en-us/articles/115004506263-What-is-a-barcode-whitelist](https://kb.10xgenomics.com/hc/en-us/articles/115004506263-What-is-a-barcode-whitelist)
Also, I am curious how the parameters for ADT quantification are specified at https://combine-lab.github.io/alevin-fry-tutorials/2023/running-simpleaf-workflow/ I do not see how one specifies the --chemistry
Hi Alevin-fry developers,
I am using simpleaf (simpleaf 0.14.1) to quantify the protein ADT data. somehow, only 109 cells in the final quantification. The corresponding RNA data gives me 10K cells. Anything I am doing wrong here?
It is 10x5' v2 data
This is the command:
version:
This is the log
Files to reproduce the results
Fastq files can be found at: https://drive.google.com/drive/folders/1diN0ybVo1mha27mvARYrAzAOqUoaM6KP?usp=sharing
Let me if you need any other information. Thanks a lot for looking into this!
Best, Tommy