alexdobin / STAR

RNA-seq aligner
MIT License
1.77k stars 495 forks source link

Mapping with STARsolo and BD Rhapsody Enhanced Beads #1607

Open curtisd0886 opened 1 year ago

curtisd0886 commented 1 year ago

Hello, I have been mapping Rhapsody WTA libraries using STARsolo with great success using the suggestions here however, BD just recently changed the structure of the bead barcodes which complicates things a little bit. They have added staggered nucleotides (see below) at the start of the first barcode. This shifts the reads by 0-4 nucleotides to increase diversity and reduce the amount of PhiX needed, but it also complicates finding the barcodes based solely on base counting. Any ideas on how we could use STARsolo to continue mapping these new types of CBs?

Screen Shot 2022-07-22 at 9 25 20 AM
s-shichino1989 commented 1 year ago

Dear curtisd0886,

We deal with the diversity insert of BD Rhapsody Enhanced beads by demultiplexing reads by read1 linker by Cutadapt 4.0. Sample script is attached.

Rhapsody_cutadapt_EnhancedBeads.txt (updated at 08.08.2022, demultiplexing and polyT checking are simultaneously performed)

Workflow:: de-multiplexing raw reads by four parts by linker position (with polyT existence check) - cut fixed (1-3) bases from the front of read1 - concatenate trimmed reads.

Then, by using STARsolo, --soloCBmatchWLtype EditDist_2 --soloCBposition 0_0_0_8 0_13_0_21 0_26_0_34 and specify CLS1-CLS3 sequence by --soloCBwhitelist works fine (Cell barcode calling rate is similar to BD Rhapsody python script (pulled from their docker image)).

Best,

Shichino

alexdobin commented 1 year ago

Hi Shichino,

this is a good approach!

wajm commented 1 year ago

Hi, @s-shichino1989 Thanks a lot. Can I use your code directly? Do I need to modify anything?

s-shichino1989 commented 1 year ago

Hi, @wajm

To use above code, you change ".txt" extension to ".sh", Install cutadapt v4.0 by using pip or conda. https://cutadapt.readthedocs.io/en/stable/installation.html

You copy your raw fastq.gz data at same directory of the code saved, move on to the directory by command-line (I tested the code only on Ubuntu 20.04 LTS) and then type sh Rhapsody_cutadapt_EnhancedBeads.sh Read1.fastq.gz Read2.fastq.gz available_thread_number_of_your_PC

When analysis finished, diversity insert-removed fastqs (Read1_trim.fastq.gz and Read2_trim.fastq.gz) might be produced in the same directory (might take some time, depending fastq data size) These fastqs could use as STARsolo input.

Best,

wajm commented 1 year ago

Hi, @s-shichino1989, Thank you so much. best,

curtisd0886 commented 1 year ago

Thank you guys for the feedback. I have tried the cutadapt trimming option along with using STAR with the below settings and they both seem to work equally well. I have decided to stick with using STAR along since it tends to be much faster than Cutadapt. Thanks again for the help!

    --soloAdapterSequence NNNNNNNNNGTGANNNNNNNNNGACA \
    --soloCBmatchWLtype 1MM multi \
    --soloCBposition 2_0_2_8 2_13_2_21 3_1_3_9 \
    --soloUMIposition 3_10_3_17 \
    --soloCBwhitelist /Data/Genomes/BD_CLS1.txt /Data/Genomes/BD_CLS2.txt /Data/Genomes/BD_CLS3.txt \
wajm commented 1 year ago

bead | universal oligo(20bp)+none/A/GT/TCA | CLS1(9bp) | Linker1(4bp) | CLS2(9bp) | Linker2(4bp) | CLS3 (9bp) | UMI (8bp) | 25dT

Hi, @curtisd0886 and @alexdobin Thank you for another option. Is there any issue about frame shift with your setting?

best, JungMo

dbrg77 commented 1 year ago

Hi all,

I think I vote for @curtisd0886 's solution. I did not know that --soloAdapterSequence supports N. This is really great, as you don't need to rely on external programs. In future, some sort of regular expression might be helpful.

In terms of @alexdobin 's concern that the anchor sequence is short and the 8 bp combination may happen randomly. I don't think this is a big issue. Because when looking for the adaptor sequence, you only look for the sequence in the cell barcode reads, which is not random.

I'm not familiar with the enhanced beads, but if I assume the read starts with none/A/GT/TCA + CLS1(9bp) + Linker1(4bp) + CLS2(9bp) + Linker2(4bp) + CLS3 (9bp), and assume the the CLS sequences are the same as the 2019 version, we can generate all possible cell barcode reads with the following script:

for i in '' A GT TCA; do
    for cls1 in $(cat BD_CLS1.txt); do
        for cls2 in $(cat BD_CLS2.txt); do
            for cls3 in $(cat BD_CLS3.txt); do
                echo "${i}${cls1}GTGA${cls2}GACA${cls3}"
            done
        done
    done
done

This generates every all possible (4 97 97 * 97) bases before UMI. The pattern GTGA[ACGT]{9}GACA seem to appear only at the expected position, no other occurrences are observer. Of course, if adding 8bp UMI at the end, there might a chance it will appear at the unintended position, but I guess the chance is really small.

I hope I made my point clear :-)

What do you guys think ?

dbrg77 commented 1 year ago

Thanks @wajm for sharing.

I'm not familiar with the BCR/TCR assay, but looking at the dT oligos, the universal sequence is TruSeq Read 1. Therefore, Read 1 should be like that mentioned in your previous post: none/A/GT/TCA + CLS1(9bp) + Linker1(4bp) + CLS2(9bp) + Linker2(4bp) + CLS3 (9bp) + UMI (8bp)

The variable starts of the reads do not affect the options @curtisd0886 use, because those options use the adaptor as the reference point, not the read. You can check the star manual to have a detailed look of the CB_UMI_Complex. I made this picture to help myself to understand it. I hope the drawing is correct:

Screenshot 2022-08-10 at 11 05 51
s-shichino1989 commented 1 year ago

Thanks all for sharing informative information and solutions.

I also tested both Cuadapt approach and @curtisd0886 solution in our own data. When changed --soloCBmatchWLtype 1MM multi to --soloCBmatchWLtype Editdist_2, final usable read amounts of both methods were nearly identical, and STARsolo-only solution was fast and simple. I could not understand how adapter (anchor position) matching is performed by STARsolo (e.g. how to deal with insertions/deletions/mismatches), but I think @curtisd0886 solution is better than Cutadapt solution based on the resulted count data.

Best,

Shichino

lin-zhongbao commented 1 year ago

Hi,all I also tested curtisd0886's settings, but I got some error informations:

_EXITING because of fatal PARAMETERS error: --soloCBmatchWLtype 1MM_multi does not work with --soloType CB_UMIComplex SOLUTION: use allowed option: use --soloCBmatchWLtype Exact (exact matches only) OR 1MM (one match with 1 mismatched base)

my STAR version is 2.7.10b, so I'm confused. Did someone could tell me what's the problem with my command?

STAR --soloType CB_UMI_Complex --soloCBwhitelist xx1.txt xx2.txt xx3.txt --soloBarcodeReadLength 0 --soloCBposition 2_0_2_8 2_13_2_21 3_1_3_9 --soloUMIposition 3_10_3_17 --soloAdapterSequence NNNNNNNNNGTGANNNNNNNNNGACA --soloAdapterMismatchesNmax 1 --soloCBmatchWLtype 1MM_multi --soloStrand Forward --soloMultiMappers Unique --soloUMIdedup 1MM_Directional_UMItools --soloUMIfiltering MultiGeneUMI --soloCellFilter None --soloCellReadStats Standard --outFilterMultimapNmax 1 --outMultimapperOrder Random --outSAMtype BAM Unsorted --runThreadN 16 --genomeDir xxx --outFileNamePrefix xxx --readFilesIn xxx1.fastq xxx2.fastq

dbrg77 commented 1 year ago

Hi @lin-zhongbao

As the error message suggested, you should change the soloCBmatchWLtype to one of the following three: 1MM, Exact or EditDist_2.

1MM_multi is not supported in complex barcodes.

lin-zhongbao commented 1 year ago

Hi @dbrg77 Thanks for that! It's works. I miss the help infomation.

By the way, do you know how to set in --soloCellFilter to get similar result with the BD Rhapsody WTA pipeline's output? Which using a second derivative analysis algorithm to determine pupative cells.

Any tools or scripts could do that will also OK.

dbrg77 commented 1 year ago

Hi @dbrg77 Thanks for that! It's works. I miss the help infomation.

By the way, do you know how to set in --soloCellFilter to get similar result with the BD Rhapsody WTA pipeline's output? Which using a second derivative analysis algorithm to determine pupative cells.

Any tools or scripts could do that will also OK.

Unfortunately, I don't have much experience on that, since we have never really tried the kit and machine by ourselves. Anyway, using their PBMC demo data (04WTA-EB-20kPBMCs), I used the following command:

STAR --runThreadN 40 \
     --genomeDir ./hg38/star_index \
     --readFilesCommand zcat \
     --outFileNamePrefix ./star_outs/ \
     --readFilesIn 04WTA_S1_L432_R2_001.fastq.gz 04WTA_S1_L432_R1_001.fastq.gz \
     --soloType CB_UMI_Complex \
     --soloAdapterSequence GTGANNNNNNNNNGACA \
     --soloCBposition 2_-9_2_-1 2_4_2_12 2_17_2_25 \
     --soloUMIposition 3_10_3_17 \
     --soloCBwhitelist BD_CLS1.txt BD_CLS2.txt BD_CLS3.txt \
     --soloCBmatchWLtype 1MM \
     --soloCellFilter EmptyDrops_CR \
     --outSAMattributes CB UB \
     --outSAMtype BAM SortedByCoordinate

which yields 17,655 cells, and the BD pipeline gives 19,185 cells, which is at the similar level. The cell names from their pipeline are numbers, and I don't know how to compare them to the cell barcodes in STAR, so I don't know about the overlap.

Maybe you could play with the parameters bit to see what happens. Maybe other people can help.

@s-shichino1989 suggests that --soloCBmatchWLtype EditDist_2 gives nearly identical results.

s-shichino1989 commented 1 year ago

@dbrg77 @lin-zhongbao

I attatch the calculation scheme of BD Rhapsody cell barcode number and associated barcode sequences.

BDRhapsody_CellLabelSequences_Sept2017.xlsx

I don't know how BD Rhapsody's Sevenbridges Genomics Platform performs cell filtering, but we compared genes x cells matrix file created by BD Rhapsody's python script and DropletUtils package (above inflection point) and by STARsolo/DropletUtils package (above inflection point), and found that survived cells are nearly identical.

MartaBenegas commented 1 year ago

Thanks to all of you! your suggestions really helped.

However, I was wondering, is it necessary to specify the anchor for both the start and end position? Is there a scenario in which the anchor is not going to be the same for the start and the end? @alexdobin @s-shichino1989

Sorry if I'm missing something, I'm still a little bit confused with these kinds of complex barcoding technologies.

Thanks in advance!