alexdobin / STAR

RNA-seq aligner
MIT License
1.83k stars 503 forks source link

STARsolo: support for multiple (3) barcode locations #838

Open ghuls opened 4 years ago

ghuls commented 4 years ago

Hi Alex,

We have a custom inhouse design which requires 3 separate barcode locations (10bp each) with a different whilelist for each barcode separated by 2 adapters.

Read 2:

[BC1]-CAGCTACTGC-[BC2]-CGAGTACCCT-[BC3]-[UMI]

with:

Any chance to get support for something like this in STARsolo?

alexdobin commented 4 years ago

Hi Gert,

sorry for the belayed reply. This is already supported with --soloType CB_UMI_complex For your geometry, the CB/UMI geometry parameters should be, I think: --soloCBposition 0_0_0_10 0_21_0_30 0_41_0_50 --soloUMIposition 0_51_0_58 And you would need to provide 3 whitelist files: --soloCBwhitelist wl1 wl2 wl3

Please let me know if you have any issues with these paraemeters. If they do not work, I could look at a few thousand reads to tweak them.

Cheers Alex

fderop commented 4 years ago

Hi Alex,

We have tested this and it does indeed work. We used the settings --soloCBposition 0_0_0_9 0_20_0_29 0_40_0_49 however.

Florian

ghuls commented 4 years ago

@alexdobin

Would it be possible to write the corrected cell barcode to the SAM attributes too?

We use the following settings:

    STAR \
        --runThreadN 8 \
        --runMode alignReads \
        --outSAMtype BAM SortedByCoordinate \
        --sysShell /bin/bash \
        --genomeDir "${star_reference_dir}" \
        --readFilesIn "${fastq_R1_filename}" "${fastq_R2_filename}" \
        --readFilesCommand 'gzip -c -d' \
        --soloCBwhitelist "${whitelist_part1_filename}" "${whitelist_part2_filename}" "${whitelist_part3_filename}" \
        --soloType CB_UMI_Complex \
        --soloCBposition 0_0_0_9 0_20_0_29 0_40_0_49 \
        --soloUMIlen 2 \
        --soloUMIposition 0_50_0_51 \
        --sjdbGTFfile ${gft_filename} \
        --soloCellFilter None \
        --soloCBmatchWLtype 1MM \
        --outSAMattributes CB UB CR CY UR UY \
        --outFileNamePrefix "${bam_filename%bam}"
alexdobin commented 4 years ago

Hi Florian,

you are right, the positions are 0-based, had to check my code to make sure. :( And you used --soloUMIposition 0_50_0_57, right? I will make changes in the Manual to clarify it.

Thanks! Alex

fderop commented 4 years ago

Hello Alex,

Our library does not have a UMI, so we had to input a dummy UMI setting to make STARsolo work. I believe we used --soloUMIposition 0_0_0_1, where these two bases are not really random. I was getting core dumps if I did not enter a UMI position.

Florian

alexdobin commented 4 years ago

Hi Gert,

the CB tags do not work presently for the "complex" CBU_UMI barcodes... Would you want to output the concatenated 30b sequence? I can implement it, though it will not be quick. In the meantime, it might be easiest to preprocess the CB/UMI read into 30b CB + dummy UMI sequence and use it as a simple CB_UMI barcode (with whitelist being the Cartesian product of the 3 whitelists), which would allow the output of CB in the BAM file.

Cheers Alex

alexdobin commented 4 years ago

Hi Florian

on the 2nd thought, I do not think replacing UMI with a constant 2b sequence is going to work, as all of the will "collapse" into 1 read, so you will have no more than 1 read per cell. I will need to implement an option to count all reads without collapsing UMIs.

Cheers Alex

fderop commented 4 years ago

Hello Alex,

We have also previously considered pre-processing the barcode read and to use simple CB/UMI. Our current design uses 96x96x96 barcode possibilities (some 880k unique barcodes) and a simple CB/UMI approach would work well. However, we might consider scaling up to 384x384x384 in the future. If I understand correctly, working with a barcode whitelist that spans 56m possibilities could be computationally challenging, which is why CB/UMI complex is so attractive.

We are currently not concerned about the collapsing of reads in the expression matrix since we are mostly interested in the .bam file, but the option to run STARsolo without UMI might be helpful to demultiplex single cell sequencing libraries not stemming from scRNA-seq experiments in the future.

Florian

alexdobin commented 4 years ago

Hi Florian,

actually, the 56m barcode list should not create serious problems, so I would try creating the 30b cell barcode, as it's generally easier to handle. Then, if you are interested in the CB tag only, you can use --soloType CB_samTagOut option (together with --soloCBmatchWLtype 1MM), which will skip the UMI counting.

Cheers Alex

ghuls commented 4 years ago

@alexdobin In the past a 30 bp barcode definetely would not work with STAR as the longest CB supported by STAR is/was 16 bp (due to the use of a 32 bit integer). So barcodes could collapse (in the past all nucleotides except the last 16 would be A when written out when being decoded from a 32 bit integer).

Does the code use a 64 bit integer already for the CB?

See an old pull request: https://github.com/alexdobin/STAR/pull/588

fderop commented 4 years ago

I can confirm that a 30 bp cell barcode works with CB_UMI_Simple.

alexdobin commented 4 years ago

Hi Gert,

I pulled in your request in 2.7.1a, it should work now. Thanks for the confirmation, Florian!

Cheers Alex