Open curtisd0886 opened 1 year ago
Dear curtisd0886,
We deal with the diversity insert of BD Rhapsody Enhanced beads by demultiplexing reads by read1 linker by Cutadapt 4.0. Sample script is attached.
Rhapsody_cutadapt_EnhancedBeads.txt (updated at 08.08.2022, demultiplexing and polyT checking are simultaneously performed)
Workflow:: de-multiplexing raw reads by four parts by linker position (with polyT existence check) - cut fixed (1-3) bases from the front of read1 - concatenate trimmed reads.
Then, by using STARsolo, --soloCBmatchWLtype EditDist_2 --soloCBposition 0_0_0_8 0_13_0_21 0_26_0_34 and specify CLS1-CLS3 sequence by --soloCBwhitelist works fine (Cell barcode calling rate is similar to BD Rhapsody python script (pulled from their docker image)).
Best,
Shichino
Hi Shichino,
this is a good approach!
Hi, @s-shichino1989 Thanks a lot. Can I use your code directly? Do I need to modify anything?
Hi, @wajm
To use above code, you change ".txt" extension to ".sh", Install cutadapt v4.0 by using pip or conda. https://cutadapt.readthedocs.io/en/stable/installation.html
You copy your raw fastq.gz data at same directory of the code saved, move on to the directory by command-line (I tested the code only on Ubuntu 20.04 LTS) and then type sh Rhapsody_cutadapt_EnhancedBeads.sh Read1.fastq.gz Read2.fastq.gz available_thread_number_of_your_PC
When analysis finished, diversity insert-removed fastqs (Read1_trim.fastq.gz and Read2_trim.fastq.gz) might be produced in the same directory (might take some time, depending fastq data size) These fastqs could use as STARsolo input.
Best,
Hi, @s-shichino1989, Thank you so much. best,
Thank you guys for the feedback. I have tried the cutadapt trimming option along with using STAR with the below settings and they both seem to work equally well. I have decided to stick with using STAR along since it tends to be much faster than Cutadapt. Thanks again for the help!
--soloAdapterSequence NNNNNNNNNGTGANNNNNNNNNGACA \
--soloCBmatchWLtype 1MM multi \
--soloCBposition 2_0_2_8 2_13_2_21 3_1_3_9 \
--soloUMIposition 3_10_3_17 \
--soloCBwhitelist /Data/Genomes/BD_CLS1.txt /Data/Genomes/BD_CLS2.txt /Data/Genomes/BD_CLS3.txt \
bead | universal oligo(20bp)+none/A/GT/TCA | CLS1(9bp) | Linker1(4bp) | CLS2(9bp) | Linker2(4bp) | CLS3 (9bp) | UMI (8bp) | 25dT
Hi, @curtisd0886 and @alexdobin Thank you for another option. Is there any issue about frame shift with your setting?
best, JungMo
Hi all,
I think I vote for @curtisd0886 's solution. I did not know that --soloAdapterSequence
supports N
. This is really great, as you don't need to rely on external programs. In future, some sort of regular expression might be helpful.
In terms of @alexdobin 's concern that the anchor sequence is short and the 8 bp combination may happen randomly. I don't think this is a big issue. Because when looking for the adaptor sequence, you only look for the sequence in the cell barcode reads, which is not random.
I'm not familiar with the enhanced beads, but if I assume the read starts with none/A/GT/TCA + CLS1(9bp) + Linker1(4bp) + CLS2(9bp) + Linker2(4bp) + CLS3 (9bp), and assume the the CLS sequences are the same as the 2019 version, we can generate all possible cell barcode reads with the following script:
for i in '' A GT TCA; do
for cls1 in $(cat BD_CLS1.txt); do
for cls2 in $(cat BD_CLS2.txt); do
for cls3 in $(cat BD_CLS3.txt); do
echo "${i}${cls1}GTGA${cls2}GACA${cls3}"
done
done
done
done
This generates every all possible (4 97 97 * 97) bases before UMI. The pattern GTGA[ACGT]{9}GACA
seem to appear only at the expected position, no other occurrences are observer. Of course, if adding 8bp UMI at the end, there might a chance it will appear at the unintended position, but I guess the chance is really small.
I hope I made my point clear :-)
What do you guys think ?
Thanks @wajm for sharing.
I'm not familiar with the BCR/TCR assay, but looking at the dT oligos, the universal sequence is TruSeq Read 1. Therefore, Read 1 should be like that mentioned in your previous post: none/A/GT/TCA + CLS1(9bp) + Linker1(4bp) + CLS2(9bp) + Linker2(4bp) + CLS3 (9bp) + UMI (8bp)
The variable starts of the reads do not affect the options @curtisd0886 use, because those options use the adaptor as the reference point, not the read. You can check the star manual to have a detailed look of the CB_UMI_Complex
. I made this picture to help myself to understand it. I hope the drawing is correct:
Thanks all for sharing informative information and solutions.
I also tested both Cuadapt approach and @curtisd0886 solution in our own data. When changed --soloCBmatchWLtype 1MM multi to --soloCBmatchWLtype Editdist_2, final usable read amounts of both methods were nearly identical, and STARsolo-only solution was fast and simple. I could not understand how adapter (anchor position) matching is performed by STARsolo (e.g. how to deal with insertions/deletions/mismatches), but I think @curtisd0886 solution is better than Cutadapt solution based on the resulted count data.
Best,
Shichino
Hi,all I also tested curtisd0886's settings, but I got some error informations:
_EXITING because of fatal PARAMETERS error: --soloCBmatchWLtype 1MM_multi does not work with --soloType CB_UMIComplex SOLUTION: use allowed option: use --soloCBmatchWLtype Exact (exact matches only) OR 1MM (one match with 1 mismatched base)
my STAR version is 2.7.10b, so I'm confused. Did someone could tell me what's the problem with my command?
STAR --soloType CB_UMI_Complex --soloCBwhitelist xx1.txt xx2.txt xx3.txt --soloBarcodeReadLength 0 --soloCBposition 2_0_2_8 2_13_2_21 3_1_3_9 --soloUMIposition 3_10_3_17 --soloAdapterSequence NNNNNNNNNGTGANNNNNNNNNGACA --soloAdapterMismatchesNmax 1 --soloCBmatchWLtype 1MM_multi --soloStrand Forward --soloMultiMappers Unique --soloUMIdedup 1MM_Directional_UMItools --soloUMIfiltering MultiGeneUMI --soloCellFilter None --soloCellReadStats Standard --outFilterMultimapNmax 1 --outMultimapperOrder Random --outSAMtype BAM Unsorted --runThreadN 16 --genomeDir xxx --outFileNamePrefix xxx --readFilesIn xxx1.fastq xxx2.fastq
Hi @lin-zhongbao
As the error message suggested, you should change the soloCBmatchWLtype
to one of the following three: 1MM
, Exact
or EditDist_2
.
1MM_multi
is not supported in complex barcodes.
Hi @dbrg77 Thanks for that! It's works. I miss the help infomation.
By the way, do you know how to set in --soloCellFilter to get similar result with the BD Rhapsody WTA pipeline's output? Which using a second derivative analysis algorithm to determine pupative cells.
Any tools or scripts could do that will also OK.
Hi @dbrg77 Thanks for that! It's works. I miss the help infomation.
By the way, do you know how to set in --soloCellFilter to get similar result with the BD Rhapsody WTA pipeline's output? Which using a second derivative analysis algorithm to determine pupative cells.
Any tools or scripts could do that will also OK.
Unfortunately, I don't have much experience on that, since we have never really tried the kit and machine by ourselves. Anyway, using their PBMC demo data (04WTA-EB-20kPBMCs
), I used the following command:
STAR --runThreadN 40 \
--genomeDir ./hg38/star_index \
--readFilesCommand zcat \
--outFileNamePrefix ./star_outs/ \
--readFilesIn 04WTA_S1_L432_R2_001.fastq.gz 04WTA_S1_L432_R1_001.fastq.gz \
--soloType CB_UMI_Complex \
--soloAdapterSequence GTGANNNNNNNNNGACA \
--soloCBposition 2_-9_2_-1 2_4_2_12 2_17_2_25 \
--soloUMIposition 3_10_3_17 \
--soloCBwhitelist BD_CLS1.txt BD_CLS2.txt BD_CLS3.txt \
--soloCBmatchWLtype 1MM \
--soloCellFilter EmptyDrops_CR \
--outSAMattributes CB UB \
--outSAMtype BAM SortedByCoordinate
which yields 17,655 cells, and the BD pipeline gives 19,185 cells, which is at the similar level. The cell names from their pipeline are numbers, and I don't know how to compare them to the cell barcodes in STAR, so I don't know about the overlap.
Maybe you could play with the parameters bit to see what happens. Maybe other people can help.
@s-shichino1989 suggests that --soloCBmatchWLtype EditDist_2
gives nearly identical results.
@dbrg77 @lin-zhongbao
I attatch the calculation scheme of BD Rhapsody cell barcode number and associated barcode sequences.
BDRhapsody_CellLabelSequences_Sept2017.xlsx
I don't know how BD Rhapsody's Sevenbridges Genomics Platform performs cell filtering, but we compared genes x cells matrix file created by BD Rhapsody's python script and DropletUtils package (above inflection point) and by STARsolo/DropletUtils package (above inflection point), and found that survived cells are nearly identical.
Thanks to all of you! your suggestions really helped.
However, I was wondering, is it necessary to specify the anchor for both the start and end position? Is there a scenario in which the anchor is not going to be the same for the start and the end? @alexdobin @s-shichino1989
Sorry if I'm missing something, I'm still a little bit confused with these kinds of complex barcoding technologies.
Thanks in advance!
Hello, I have been mapping Rhapsody WTA libraries using STARsolo with great success using the suggestions here however, BD just recently changed the structure of the bead barcodes which complicates things a little bit. They have added staggered nucleotides (see below) at the start of the first barcode. This shifts the reads by 0-4 nucleotides to increase diversity and reduce the amount of PhiX needed, but it also complicates finding the barcodes based solely on base counting. Any ideas on how we could use STARsolo to continue mapping these new types of CBs?