alexdobin / STAR

RNA-seq aligner
MIT License
1.83k stars 503 forks source link

STARSolo InDrop V3? #825

Open dakota-hawkins opened 4 years ago

dakota-hawkins commented 4 years ago

Hi Alex,

Thanks so much for all of your work toward STAR. It's my preferred aligner, and really I have nothing but great things to say about it.

I had a quick question regarding whether STARSolo can work with InDrop V3 reads? Per #605 it does appear STARSolo can work with version 1 and version 2 InDrop data, where there are 2 read files, but I was unsure if this would also work with V3 where there are 4 distinct read files per library (https://github.com/indrops/indrops)?

Thanks again,

Dakota

alexdobin commented 4 years ago

Hi Dakota,

it will not work with 4 files out of the box. You would need to combine all the barcodes into one read. Then the --soloType CB_UMI_Complex may be able to deal with it, as it has flexible positions of the barcodes. If you can explain how the barcodes look like, I could help with finding the right parameters.

Cheers Alex

dakota-hawkins commented 4 years ago

Hi Alex,

Thanks so much for your response! The format is definitely a bit complex, but I'll try to explain to the best of my understanding how the barcodes are formatted.

To begin, per the indrops repo (https://github.com/indrops/indrops) the four read files are formatted as follow:

R1 is the biological read

Example Read:

@NS500422:374:HFLC7BGXY:1:11105:23196:17726 1:N:0:0
AACCCGTATACATAAAATCTAGACAAAAAAGGAAGGAATCGAACCCCCAAAGCAGG
+
AAAAAEEEEEEEEEEEEAEEEE/EEEEEEEEEEEEEAAE<E/<EEAEE/E//E//E

R2 carries the first half of the gel barcode

Example Read:

@NS500422:374:HFLC7BGXY:1:11105:23196:17726 2:N:0:0
CTGTGACC
+
A/AAAEEE

R3 carries the library index

Example Read

@NS500422:374:HFLC7BGXY:1:11105:23196:17726 3:N:0:0
AGGCTT
+
AAAAAE

R4 caries the second half the barcode, the UMI, and a fraction of the polyA tail

Example Read

@NS500422:374:HFLC7BGXY:1:11105:23196:17726 4:N:0:0
GCATGGGTTCTCATTTTTATT
+
AAAA/EEEEEEEEE///////

Parsing through the indrops github (mainly line 1620-1622 in https://github.com/indrops/indrops/blob/master/indrops.py), it seems the first 8 characters in R4 are the barcode, the next 6 characters are the UMI, and the remaining nucleotides would be the polyA tail.

Appreciate any insight you could could give in terms of parameters to pass to CB_UMI_Complex after combining reads!

Thanks again,

Dakota

alexdobin commented 4 years ago

Hi Dakota,

thanks for the explanations! I think it should be relatively easy to make it work.

First you would need to concatenate the reads sequences from R2 R3 R4 together, e.g.

paste <(zcat R2.gz) <(zcat R3.gz) <(zcat R4.gz) | awk '{if (NR%4==1) {print $1 " " $2} else if (NR%4==3) {print "+"} else {print $1 $2}}' > CB_UMI.fastq

Since the barcodes lengths are constant, the normal --soloType CB_UMI_Simple (a.k.a. Droplet) should work. If there is no whitelist, you would need --soloCBwhitelist None. --soloStrand Unstranded [Reverse/Forward] depends on the strand of R1 with respect to original RNA. The total CB length from R2+R3+R4=8 + 6 + 8=22 (if R3 is constant, it does not need to be included, of course).

--readFilesIn R1_cDNA.fastq CB_UMI.fastq 
--soloType CB_UMI_Simple 
--soloCBwhitelist None 
--soloFeatures Gene SJ GeneFull 
--soloStrand Unstranded [Reverse/Forward]
--soloCBstart 1 --soloCBlen 22 --soloUMIstart 23 --soloUMIlen 6

Please let me know if this works.

Cheers Alex

dakota-hawkins commented 4 years ago

Hi Alex,

Thanks for for the follow up! Following your advice, concatenating the the read files, and issuing the command you provided raised the error:

EXITING because of FATAL ERROR in input read file: the total length of barcode sequence is 16 not equal to expected 28
Read ID=@NB500996:315:HHNJ5BGXC:1:11101:10820:1063 1 N 0   Sequence=CTGTCGCAATTAGACN
SOLUTION: make sure that the barcode read is the last file in --readFilesIn , and check that it has the correct formatting
          If UMI+CB length is not equal to the barcode read length, specify barcode read length with --soloBarcodeReadLength

Feb 12 19:57:45 ...... FATAL ERROR, exiting

The interleafing step produced reads like:

@NB500996:315:HHNJ5BGXC:1:11101:10820:1063 2:N:0:0
CTGTCGCAATTAGACN
+
AAAAAEEAA/AA/EA#

I slightly modified your above one-liner to

paste <(cat R2_test.fastq) <(cat R3_test.fastq) <(cat R4_test.fastq) |  awk '{if (NR%4==1 || NR%4==3) {print $1 " combined"} else {print $1 $2 $3}}' > $fastq

Which seemed to concatenate the reads completely:

@NB500996:315:HHNJ5BGXC:1:11101:10820:1063 combined
CTGTCGCAATTAGACNGCCGGATTCCTNNN
+ combined
AAAAAEEAA/AA/EA#//AA/EEEAEE###

Regarding R3, it seems these are not constant -- at least in the current dataset that I'm working with -- and are library dependent. This is likely a bcl2fastq/sample sheet issue that I should be able to work out.

For a barcode whitelist, after looking through the indrops repo it appears -- for indrops V3 -- barcodes are a Cartesian product of the barcodes listed in this file and their reverse complements. Luckily it does seem the length is constant, and that I should be able to produce a V3 whitelist relatively easily.

Right now I'll plan to:

  1. Segregate reads according to the library indices in R3.
  2. Interleaf reads from R2 and R4 for a barcode + UMI read.
  3. Run starSolo following the guidelines you outlined above with the described whitelist file.

Thanks again for all your help!

I'll update the issue when/if I make some progress in case anyone else runs into a similar issue.

mariusmessemaker commented 4 years ago

Hi Dakota,

I wrote a python function to generate inDrops V3 whitelist from the gel_barcode2_list.txt file (mostly adapted from the indrops.py code): https://github.com/indrops/indrops/blob/master/ref/barcode_lists/gel_barcode2_list.txt. Function arguments are:

indexes: list of strings that contain the library adapter sequences that were used to generate the libraries (e.g. ['AGAGGATA', 'TACTCCTT']).

name: string that contains the save .txt name for your whitelist. indexlength: int that specifies the number of bases in your library index (to make sure you don't make manual copy errors) numberOflibraries: int that specifies the number of libraries that were generated (i.e. the number of different library indexes that were used; also to make sure you don't make manual copy errors).

The function outputs the Cartesian product of all the R2 BC1s, R3 library indexes, R3 BC2s in concatenated strings in the order:

R1: 'AAACAAAC'
R3: 'ATAGAGAG'
R4: 'GTTTGTTT'

Concatenated string: 'AAACAAACATAGAGAGGTTTGTTT'

The function:

def generateIndropV3Whitelist(indexes, name, indexlength, numberOflibraries):
    # Code mostly adapted from Adrian Veres author of indrops.py: https://github.com/indrops/indrops
    # Assuming strings are concatenate in order: R2, R3, R4 

    from itertools import product, combinations
    import string

    ___tbl = {'A':'T', 'T':'A', 'C':'G', 'G':'C', 'N':'N'}
    def rev_comp(seq):
        return ''.join(___tbl[s] for s in seq[::-1])

    def checkIfDuplicates(listOfElems):
        setOfElems = set()
        for elem in listOfElems:
            if elem in setOfElems:
                return True
            else:
                setOfElems.add(elem)
        return False

    # Check that you copied the indexes completely 
    for index in indexes:
        if len(index) != indexlength:
            return 

    # Check that you didn't copy the indexes with duplicates 
    if checkIfDuplicates(indexes):
        return 

    # Check if you supplied the correct number of non-duplicate indexes 
    if len(indexes) != numberOflibraries: 
        return 

    with open('gel_barcode2_list.txt') as f:
        bc2s = [line.rstrip() for line in f]
        rev_bc2s = [rev_comp(bc2) for bc2 in bc2s]

    barcode_iter = product(bc2s, indexes, rev_bc2s)
    v3_names = []

    with open(name, 'w') as f:
        for barcode in barcode_iter: 
            print(''.join(barcode)) 
            f.write(''.join(barcode) + '\n')
    return

For example, you can use the function as follows:

#  Sample 1,  library_index: "ATAGAGAG"
#  Sample 2, library_index: "AGAGGATA"
#  Sample 3, library_index: "TACTCCTT"
#  Sample 4, library_index: "AGGCTTAG"
#  Sample 5, library_index: "CTAGTCGA"
#  Sample 6, library_index: "AGCTAGAA"
#  Sample 7, library_index: "CTTAATAG" 
#  Sample 8, library_index: "ATAGCCTT"

indexes = ['ATAGAGAG', 'AGAGGATA', 'TACTCCTT', 'AGGCTTAG', 'CTAGTCGA', 'AGCTAGAA', 'CTTAATAG', 'ATAGCCTT']
generateIndropV3Whitelist(indexes, './whitelists/pool1_whitelist.txt', 8, 8)

This function call outputs 384 BC1 x 8 library indexes x 384 BC2 = 1179648 concatenated BC strings. You can also supply an empty library index string, which will yield the Cartesian product of the R2 BC1s and R4 BC2s as concatenated strings:

indexes = [' ']
generateIndropV3Whitelist(indexes, './whitelists/bc1andbc2.txt', 0, 1)
dakota-hawkins commented 4 years ago

Thanks, Marius!

I wrote a small script to do essentially the same thing, though it just generated a static whitelist file of all of the gel2barcodes + their reverse compliments. Perhaps I was naive, however, as I did not include any library indices. Is there a reason you chose to do so?

I also wrote a set of functions to segregate reads base on library indices (matching provided library indices with observed sequences in R3 within some hamming distance range). It could probably be sped up a bit more, maybe making better use of generators or lower-level fastq parsers, but for now it gets the job done. For readability, I'll just link the functions here.

The general idea was to get R1 reads and R2R4 combined reads, and then for each library, feed them to STARsolo with the previously explained static whitelist. However, when I ran the commands I received segmentation faults + memory allocation issues.

I've been trying to figure out the cause, but as of right now, I'm not sure if this is A. an issue with the specific genome we're working with (we've had issues with the annotations in the past) B. an issue with how I concatenated the reads, or C. a bug in STARsolo itself.

Here are some of the error logs I received:

*** Error in `STAR': malloc(): memory corruption: 0x000000000791bb60 ***
======= Backtrace: =========
[0x5dee16]
[0x5e08f9]
[0x5b86b9]
[0x42f0c0]
[0x4179c5]
[0x43325e]
[0x42c925]
[0x406275]
[0x5af694]
[0x5af911]
[0x414356]
======= Memory map: ========
00400000-00726000 r-xp 00000000 00:2d 19521690300                        /projectnb/bradham/dyh0110/.conda/envs/indrops-star/bin/STAR
00925000-00938000 rw-p 00325000 00:2d 19521690300                        /projectnb/bradham/dyh0110/.conda/envs/indrops-star/bin/STAR
00938000-00955000 rw-p 00000000 00:00 0 
01b61000-08023000 rw-p 00000000 00:00 0                                  [heap]
2aaed5fcc000-2aaed60f0000 rw-p 00000000 00:00 0 
2aaed618f000-2aaede44f000 rw-p 00000000 00:00 0 
2aaee0000000-2aaee002c000 rw-p 00000000 00:00 0 
2aaee002c000-2aaee4000000 ---p 00000000 00:00 0 
2aaef43df000-2aaefd07f000 rw-p 00000000 00:00 0 
2ab160805000-2ab161529000 rw-p 00000000 00:00 0 
2ab161e4a000-2ab16be97000 rw-p 00000000 00:00 0 
7fff8b364000-7fff8b3ac000 rw-p 00000000 00:00 0                          [stack]
7fff8b3b3000-7fff8b3b5000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
/usr/bin/bash: line 1: 30512 Aborted                 STAR --genomeDir /projectnb/bradham/data/ReferenceSequences/GenomeAnnotations/indrops_star/ --readFilesIn /projectnb/bradham/data/scRNASeq/raw/scRNASeq2019-10-24/processed/fastq/combined/ASW-18hpf_cdna.fastq /projectnb/bradham/data/scRNASeq/raw/scRNASeq2019-10-24/processed/fastq/combined/ASW-18hpf_bc_umi.fastq --soloType CB_UMI_Simple --soloCBwhitelist ref/gel_barcode3_list.txt --soloFeatures Gene SJ GeneFull --soloStrand Unstranded Forward --outFileNamePrefix /projectnb/bradham/data/scRNASeq/raw/scRNASeq2019-10-24/processed/STAR/ASW-18hpf --soloCBstart 1 --soloCBlen 16 --soloUMIstart 17 --soloUMIlen 6
[Tue Feb 18 21:50:46 2020]
Error in rule run_star_solo:
    jobid: 0
    output: /projectnb/bradham/data/scRNASeq/raw/scRNASeq2019-10-24/processed/STAR/ASW-18hpf/Solo.out/Gene/raw/matrix.mtx, /projectnb/bradham/data/scRNASeq/raw/scRNASeq2019-10-24/processed/STAR/ASW-18hpf/Solo.out/Gene/raw/barcodes.tsv, /projectnb/bradham/data/scRNASeq/raw/scRNASeq2019-10-24/processed/STAR/ASW-18hpf/Solo.out/Gene/raw/features.tsv
    shell:
        STAR --genomeDir /projectnb/bradham/data/ReferenceSequences/GenomeAnnotations/indrops_star/ --readFilesIn /projectnb/bradham/data/scRNASeq/raw/scRNASeq2019-10-24/processed/fastq/combined/ASW-18hpf_cdna.fastq /projectnb/bradham/data/scRNASeq/raw/scRNASeq2019-10-24/processed/fastq/combined/ASW-18hpf_bc_umi.fastq --soloType CB_UMI_Simple --soloCBwhitelist ref/gel_barcode3_list.txt --soloFeatures Gene SJ GeneFull --soloStrand Unstranded Forward --outFileNamePrefix /projectnb/bradham/data/scRNASeq/raw/scRNASeq2019-10-24/processed/STAR/ASW-18hpf --soloCBstart 1 --soloCBlen 16 --soloUMIstart 17 --soloUMIlen 6
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

And the tail-end of the Log file

--sjdbOverhang = 56 taken from the generated genome
Started loading the genome: Tue Feb 18 18:39:56 2020

Genome: size given as a parameter = 1749854337
SA: size given as a parameter = 7594129599
SAindex: size given as a parameter = 1
Read from SAindex: pGe.gSAindexNbases=14  nSAi=357913940
nGenome=1749854337;  nSAbyte=7594129599
GstrandBit=32   SA number of indices=1841001114
Shared memory is not used for genomes. Allocated a private copy of the genome.
Genome file size: 1749854337 bytes; state: good=1 eof=0 fail=0 bad=0
Loading Genome ... done! state: good=1 eof=0 fail=0 bad=0; loaded 1749854337 bytes
SA file size: 7594129599 bytes; state: good=1 eof=0 fail=0 bad=0
Loading SA ... done! state: good=1 eof=0 fail=0 bad=0; loaded 7594129599 bytes
Loading SAindex ... done: 1565873619 bytes
Finished loading the genome: Tue Feb 18 18:40:09 2020

Processing splice junctions database sjdbN=168209,   pGe.sjdbOverhang=56 
alignIntronMax=alignMatesGapMax=0, the max intron size will be approximately determined by (2^winBinNbits)*winAnchorDistNbins=589824
winBinNbits=16 > pGe.gChrBinNbits=13   redefining:
winBinNbits=13
Loaded transcript database, nTr=30238
Loaded exon database, nEx=198447
Thread #0 end of input stream, nextChar=-1
Completed: thread #0
Feb 18 18:40:11 ..... started Solo counting
Feb 18 18:40:11 ... Starting Solo post-map for Gene
Feb 18 18:40:11 ... Finished allocating arrays for Solo 0 GB
Feb 18 18:40:11 ... Finished reading reads from Solo files nCB=0, nReadPerCBmax=0, nMatch=0
Feb 18 18:40:11 ... Finished collapsing UMIs

Looking at expected output files, it seem a single line was written to Solo.out/Gene/raw/matrix.mtx file before the crash.

alexdobin commented 4 years ago

Hi Dakota, Marius,

this is a great discussion that will be useful to other inDrop users. Once the pipelines work, it would be great if you could create GitHub repositories for the pipelines.

The error is caused by STAR not detecting any cell barcodes, so it seems something went wrong with the formatting of either barcode reads or whitelists. Could you check manually a few reads that you generated to see if you can find their barcodes in the whitelist?

Cheers Alex

mariusmessemaker commented 4 years ago

Hi Dakota, The reason I included library indexes is because the library indexes that we use are constantly 8 bp; therefore, you can just concatenate the gel-barcode 1 in R2, the library index in R3, and the gel-barcode 2 in R4 because together these barcodes specify an unique cell. I concatenated the reads with the following script and I also checked that all the reads in the R2.fastq, R3.fastq, and R4.fastq are in the correct order (I tested the script manually on a 4 entries fastq subset and checked multiple reads manually in the full concatenated fastq file):

srun paste R2.fastq R3.fastq R4.fastq | awk '{if ((NR%4==1) && ($1 == $3) && ($3 == $5)) {print $1 " 2:3:4:paste"} else if ((NR%4==1) && ($1 != $3)) {print "singleton"} else if ((NR%4==1) && ($3 != $5)) {print "singleton"} else if (NR%4==3) {print $1} else {print $1 $2 $3}}' > R234_combined.fastq

if cat R234_combined.fastq | grep -q "singleton"
  then
    echo singletons were found therefore one or multiple fastqs are broken
fi

# (all my fastq files were in the correct order)

However, if your library demultiplexing script works properly, then you should be able to use your or my whitelist generator without library index barcode to generate a whitelist that matches the barcodes in the concatenated R2, R4 reads for each library? Did you check whether the concatenated R2, R4 reads match you barcodes in the whitelist as Alex suggested?

Hi Alex, Thank you for creating STARSolo and STAR! Yes, I can find the barcodes in the whitelist when I manually check many reads (also when I use zUMIs I can find the expected number of barcodes using the whitelist). However, when I try to run STAR I get the following error caused by not all the concatenated R2,R3,R4 reads having the correct length (i.e. 30 bp):

EXITING because of FATAL ERROR in input read file: the total length of barcode sequence is 29 not equal to expected 30
Read ID=@NB551608:27:HHJCTBGX9:1:11101:26865:1425 9708 N 0   Sequence=NNNTTCTCTAGTCGANNAAACTGACATGG
SOLUTION: make sure that the barcode read is the last file in --readFilesIn , and check that it has the correct formatting
          If UMI+CB length is not equal to the barcode read length, specify barcode read length with --soloBarcodeReadLength

This error is expected when STAR does not remove reads that don't have the correct length because if you check the concatenated R2,R3,R4 barcode read length distribution with the following script:

zcat R234_combined.fastq.gz | head -200000000 | awk '{if(NR%4==2) print length($1)}' | sort -n | uniq -c > pool1_length.txt

You get the following distribution:

1 24
101 25
157 26
306 27
2459 28
8252 29
49988724 30

As expected most reads have the correct length of 30 bp and some reads don't have the correct length of 30 bp because of sequencing errors, which should be removed together with their corresponding R1 because these reads can not be assigned a cell and UMI anymore.

I can solve the problem of reads not having the correct length myself by removing reads in the concatenated R2, R3, R4 reads that are too short and their corresponding R1 read with a custom script or cutadapt (I am planning to remove reads with cutadapt in the near future and test whether I can run STARSolo). However, should STARSolo not be able to check whether barcode and UMI reads have a correct length and if needed remove these reads when they have an incorrect length? I am sure CellRanger does this as well? Or maybe I am ignorant and missing an option of STARSolo that I should have specified when running STARSolo? I ran STAR as follows:

ulimit -n 10000

srun STAR --runThreadN 20 \
--genomeDir /n/data1/mgh/csb/pittet/references/mouse_mm10_99_STAR_updateFeb2020/STAR_genome \
--sjdbGTFfile /n/data1/mgh/csb/pittet/references/mouse_mm10_99_STAR_updateFeb2020/Mus_musculus.GRCm38.99.filtered.gtf \
--readFilesIn /n/data1/mgh/csb/pittet/seq_archive/18_12_28_RZ/181227_NB551608_0027_AHHJCTBGX9/software.rc.fas.harvard.edu/ngsdata/181227_NB551608_0027_AHHJCTBGX9/Tumor_vs_Liver/SUB07818pool1_S1_R1_001.fastq.gz /n/scratch2/mm884-pittet/tmp_fastq_files/SUB07818pool1_S1_R234_combined.fastq.gz \
--readFilesCommand zcat \
--soloType CB_UMI_Simple \
--soloCBwhitelist /n/data1/mgh/csb/pittet/marius//whitelists_pools/pool1_whitelist.txt \
--soloCBstart 1 --soloCBlen 24 --soloUMIstart 25 --soloUMIlen 6 \
--soloCBmatchWLtype 1MM_multi_pseudocounts \
--soloStrand Unstranded [Reverse/Forward] \
--soloFeatures Gene GeneFull \
--soloUMIdedup 1MM_All \
--soloUMIfiltering MultiGeneUMI \
--soloCellFilter None \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI nM AS CR UR CB UB GX GN sS sQ sM \
--outFileNamePrefix /n/scratch2/mm884-pittet/indrops_tmp_files/pool1/

If we have a inDrops STARSolo pipeline, I will be happy to make a github repository for other inDrops users! In addition, I was wondering how STARSolo checks the phred quality scores of the bases in the barcode reads and whether you are planning to make it possible for users to specify a threshold (for example, for inDrops I would have the threshold: average phred quality of 4 bases windows should be at least > 20). In addition, I was wondering what is the status of the development of the Velocyto feature?

dakota-hawkins commented 4 years ago

Thanks Alex, Marius!

Thanks for the insight, Alex! I'll double check the whitelist + combined reads to ensure the combined barcodes match expected barcodes in the generated whitelist file. Perhaps using Marius's approach will fix the issue.

dakota-hawkins commented 4 years ago

Hi Alex,

I double checked using a small subset of our data (400 reads) and it does appear the barcodes in the combine R2R4 reads -- the first 16 characters -- are present in the whitelist file I generated.

Trying to run STARsolo using the following command:

        STAR --genomeDir {params.index}
        --readFilesIn {input.cdna} {input.bc_umi} --soloType CB_UMI_Simple 
        --soloCBwhitelist {input.whitelist} --soloFeatures Gene SJ GeneFull 
        --soloStrand Unstranded Forward --outFileNamePrefix {params.out} 
       --soloCBstart 1 --soloCBlen 16 --soloUMIstart 18 --soloUMIlen 6

Still results in a segmentation fault. To double check whitelist issues, I replace --soloCBwhitelist {input.whitelist} with --soloCBwhitelist None, and I still receive a segmentation fault. This is also true when I run on the complete dataset.

When I run standard STAR alignment on the cDNA reads, we receive generally good alignment (in terms of percent reads aligned). However, when comparing genomic alignments to transcriptomic alignments, we seem to drop considerably (say 90% to 2%). We generally believe this is an issue with our annotations as they do not seem to include UTR information, and genes are annotated as starting and stopping at the same locations as exons -- not allowing for UTR alignments even without annotations.

Is it possible these are related and a lack of transcriptomic alignments could result in a segmentation fault?

mariusmessemaker commented 4 years ago

Hi Dakota,

Shouldn't --soloUMIstart be set at 17 instead of 18? As:

BC1 (R2) = XXXXXXXX BC2 (R4) = YYYYYYYY UMI (R4) = ZZZZZZ

1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22               
X  X  X  X  X  X  X  X  Y  Y  Y  Y  Y  Y  Y  Y  Z  Z  Z  Z  Z  Z

Also, since you have a fraction of the polyA tail in R4 your concatenated R2R4 reads might be 22 bp + 7 bp = 29 bp? In this case, I think you need to specify--soloBarcodeReadLength 29 because your concatenated R2R4 barcode read length does not equal gel-barcode 1 + gel-barcode 2 + UMI. However, you might still get the same STAR error as me that reads are smaller than the expected read length. Because your reads with sequencing errors should still be longer than gel-barcode 1 + gel-barcode 2 + UMI (because of poly-A fraction) you might specify --soloBarcodeReadLength 0 to omit the check of concatenated R2R4 read lengths altogether; however, there might be a small probability that reads with an incorrect length match barcodes in your provided whitelist within 1 hamming distance while these reads should be thrown away. Alex if I specify --soloBarcodeReadLength 0 will STAR still give the same error that some of my reads are too short? Or will STAR then later throw away these reads because they cannot be assigned to both a cell and a UMI (this will solve my problem)?

We generally believe this is an issue with our annotations as they do not seem to include UTR information, and genes are annotated as starting and stopping at the same locations as exons -- not allowing for UTR alignments even without annotations. In GTF format the feature "exon" includes the feature "5'UTR" (first exon) and the feature "3'UTR" (last exon). In addition, the feature exon includes the feature "CDS" which represents the translated coding sequence of a transcript. On average, inDrops generates 400-600 bp DNA fragments (this is of course a length distribution that varies per run). If genes have a poly-adenylation site in their 3'UTR at a distance > 600 bp from the last CDS, then most reads should align to the 3'UTR of these genes (the cDNA read starts at the 5'end of the DNA fragments). If you have many of such genes, a substantial amount of reads are not counted when the 3'UTR is not included in the exon feature of your GTF because --soloFeatures Gene only counts alignments that are entirely contained in exons (or follow it's splice junctions). In addition, although --soloFeatures GeneFull also counts reads that align to "introns" this will not increase the read count for genes of which the reads mostly align to their 3'UTR.

dakota-hawkins commented 4 years ago

Hi Marius, you're right that it should be the 17th index. I was playing around with the parameter to see if I accidentally bungled the original 17 and just copied the wrong line. Either way, setting --soloUMIstart 17 still appears to result in a segmentation fault. I'll look into --soloBarcodeReadLength 0 to see if this resolves the issue. Likewise I'll probably test on the reads disseminated in the indrops repo to help determine if it's an implementation issue or a data issue.

Thanks for the insights re: UTRs and how STAR might interact with them.

Re:

GTF format the feature "exon" includes the feature "5'UTR" (first exon) and the feature "3'UTR" (last exon)

Are you saying UTRs are implicitly include in the first/last exon, or that they are simply possible features that should be listed? For example a typical annotation in our gff might look like this:

scaffold10000_len74836_cov183   EVM gene    83072   86662   .   +   .   ID=evm.TU.scaffold10000_len74836_cov183.2;Name=EVM%20prediction%20scaffold10000_len74836_cov183.2
scaffold10000_len74836_cov183   EVM mRNA    83072   86662   .   +   .   ID=evm.model.scaffold10000_len74836_cov183.2;Parent=evm.TU.scaffold10000_len74836_cov183.2
scaffold10000_len74836_cov183   EVM exon    83072   83182   .   +   .   ID=evm.model.scaffold10000_len74836_cov183.2.exon1;Parent=evm.model.scaffold10000_len74836_cov183.2
scaffold10000_len74836_cov183   EVM CDS 83072   83182   .   +   0   ID=cds.evm.model.scaffold10000_len74836_cov183.2;Parent=evm.model.scaffold10000_len74836_cov183.2
scaffold10000_len74836_cov183   EVM exon    84379   84462   .   +   .   ID=evm.model.scaffold10000_len74836_cov183.2.exon2;Parent=evm.model.scaffold10000_len74836_cov183.2
scaffold10000_len74836_cov183   EVM CDS 84379   84462   .   +   0   ID=cds.evm.model.scaffold10000_len74836_cov183.2;Parent=evm.model.scaffold10000_len74836_cov183.2
scaffold10000_len74836_cov183   EVM exon    86558   86662   .   +   .   ID=evm.model.scaffold10000_len74836_cov183.2.exon3;Parent=evm.model.scaffold10000_len74836_cov183.2
scaffold10000_len74836_cov183   EVM CDS 86558   86662   .   +   0   ID=cds.evm.model.scaffold10000_len74836_cov183.2;Parent=evm.model.scaffold10000_len74836_cov183.2

Where gene and final exon + cds regions have the same stopping point -- presumably leaving UTRs unaccounted for as they should not be considered coding.

mariusmessemaker commented 4 years ago

An exon is any part of a gene that encodes part of the mature mRNA transcript of that gene, which means that exon boundaries are the transcription start site, splice donor, splice acceptor and poly-adenylation site. Therefore, the first exon should implicitly include the 5'UTR, start codon, and the first CDS until the first splice donor site & the last exon should implicitly include the last CDS (the exon starts from the last splice acceptor site), stop codon, and the 3'UTR (the exon ends at the poly-adenylation site in the 3'UTR). If your model organism has 3'UTRs in 400-600 bp range that are not included in the exon feature of your annotation file (which they should be because 3'UTRs are part of the mature transcripts), you will lose read counts because inDrops generates 400-600 bp DNA fragments on average. The features "5'UTR" and "3'UTR" are optional features to include in your annotation file, and will not change the read count.

For more information about the GTF format: http://mblab.wustl.edu/GTF22.html

alexdobin commented 4 years ago

Hi Dakota, Marius,

there are several issues here and I am getting a bit confused. Let's try to separate them into several examples.

Dakota, could you please post Solo.out/Barcodes.stats and Solo.out/Gene/Features.stats

Marius, --soloBarcodeReadLength 0 will not check the length of the read, however, if the read length is < than expected 30, it will result in undefined behavior for these reads. I will implement a check to discard such reads, but at the moment it's best to get rid of them pre-mapping. If you do that, do the results look OK? If not, please also post the Solo.out/Barcodes.stats and Solo.out/Gene/Features.stats

Cheers Alex

mariusmessemaker commented 4 years ago

Hi Alex,

I trimmed the reads with cutadapt before running STARsolo and now STARsolo works perfectly! It also has really quick running times for >30,000 cells! Thank you for your great tool. I will make a Github repository that I will post here for inDrops V3 users that I will post here.

Cheers, Marius

alexdobin commented 4 years ago

Hi Marius,

great - thanks for letting me know and for sharing it with the other users.

Cheers Alex

droplet-lab commented 4 years ago

Hi all, Just checking if there was any advancements on developing a repo for indrops v3 @mariusmessemaker ? I checked the repo but it was empty. I would be awesome if you could share the code you've used to run this ?

Thanks a lot in advance for your help! Cheer, Jo

mariusmessemaker commented 4 years ago

Hi Jo,

Yes, I wanted to benchmark the inDrops results first before making the Repo; however, I had significant delays because I had to move to another country due to Covid-19.

Best, Marius

droplet-lab commented 4 years ago

Thank you Marius, yes please do let me know if you manage to benchmark. It would be so useful!

Thanks a lot, Jo

dakota-hawkins commented 4 years ago

Sorry for the long radio silence: COVID + other projects seemed to have kept me busy. One of those projects was getting hold of a different genome + annotations, which seemed to have fixed the segfault error I was getting -- which is definitely nice! Thanks for all your help with this, @alexdobin

@Jachimdejonghe Jo, I have a working bare-bones pipeline using snakemake here. It hasn't been tested extensively, but perhaps will serve as a decent starting point for you. Feel free to clone/fork as you desire, and if you have suggestions/improvements I would be more than happy to discuss/include them.

Best,

Dakota

Edit I haven't tested a fresh install of the pipeline, so if you do choose to use the pipeline, definitely let me know if you run into any issues. Likewise, I've only run it on linux machines, so other issues may arise if you try to run it on something else.

mariusmessemaker commented 4 years ago

Hi Jo,

Before I post my bench marking results here with corresponding repo (so far they look good), you can also in the meantime use the indrops-STARSolo pipeline below if you want. The difference between mine and Dakota's pipeline is that Dakota's pipeline demultiplexes the reads by library index first while my pipeline does not (I use a two line mapper/barcode parser in python that adds a library name to the meta (.obs) of Scanpy). Therefore, the output cell barcodes in the barcode.tsv will specify unique cells as follows:

# With R3 library index: 
R1: 'AAACAAAC'
R3: 'ATAGAGAG' 
R4: 'GTTTGTTT'

Cell barcode string (in barcodes.tsv): 'AAACAAACATAGAGAGGTTTGTTT'

# Without R3 library index (if you did not use library indexes when running inDrops): 
R1: 'AAACAAAC'
R4: 'GTTTGTTT'

Cell barcode string (in barcodes.tsv): 'AAACAAACGTTTGTTT'

The pipeline:

  1. Generate your specific inDrops V3 barcode whitelists with or without library index with the whitelist generator function: https://github.com/alexdobin/STAR/issues/825#issuecomment-590026290

  2. Run the following shell script to combine R2, R3 (if you've used library indexes), and the first 8 bp of R4 into one R234 read:

    
    paste R2.fastq R3.fastq R4.fastq | awk '{if ((NR%4==1) && ($1 == $3) && ($3 == $5)) {print $1 " 2:3:4:paste"} else if ((NR%4==1) && ($1 != $3)) {print "singleton"} else if ((NR%4==1) && ($3 != $5)) {print "singleton"} else if (NR%4==3) {print $1} else {print $1 $2 $3}}' > R234.fastq

Check for singleton reads (although .fastq should never be unordered):

if cat R234.fastq | grep -q "singleton" then echo singletons were found therefore one or multiple fastqs are broken fi

3. Trim the R1 and  R234 .fastq files with cutadapt (STAR expects all cell barcode input reads to have the correct length): 

cutadapt --cores 20 \ --minimum-length 1:30 \ --output R1_trimmed.fastq.gz \ --paired-output R234_trimmed.fastq.gz \ R1.fastq.gz \ R234.fastq.gz

4. Run STARSolo with `R1_trimmed.fastq.gz`, `R234_trimmed.fastq.gz`, and your generated V3 barcode whitelist as input: 

ulimit -n 10000

srun STAR --runThreadN 20 \ --genomeDir STAR_genome \ --sjdbGTFfile Mus_musculus.GRCm38.99.filtered.gtf \ --readFilesIn R1_trimmed.fastq.gz R234_trimmed.fastq.gz \ --readFilesCommand zcat \ --soloType CB_UMI_Simple \ --soloCBwhitelist indrops_v3_whitelist.txt \ --soloCBstart 1 --soloCBlen 24 --soloUMIstart 25 --soloUMIlen 6 \ --soloCBmatchWLtype 1MM_multi_pseudocounts \ --soloStrand Unstranded [Reverse/Forward] \ --soloFeatures Gene GeneFull \ --soloUMIdedup 1MM_All \ --soloUMIfiltering MultiGeneUMI \ --soloCellFilter CellRanger2.2 7250 0.99 10 \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes NH HI nM AS CR UR CB UB GX GN sS sQ sM \ --outFileNamePrefix sample/

droplet-lab commented 4 years ago

Thank you both so much and I hope you are all well. I will try both and let you know how it goes!

droplet-lab commented 4 years ago

Hi @mariusmessemaker and @dakota-hawkins , so I tried Marius's approach for now. It looks like it works quite ok. The resulting cell types and clusters identified seem reasonable and it compares well to other pipelines.

I slightly changed step 2 to take fastq.gz in, although it makes everything much slower :

paste <(zcat SAMPLE_R2.fastq.gz) <(zcat SAMPLE_R3.fastq.gz) <(zcat SAMPLE_R4.fastq.gz) | awk '{if ((NR%4==1) && ($1 == $3) && ($3 == $5)) {print $1 " 2:3:4:paste"} else if ((NR%4==1) && ($1 != $3)) {print "singleton"} else if ((NR%4==1) && ($3 != $5)) {print "singleton"} else if (NR%4==3) {print $1} else {print $1 $2 $3}}' | gzip > SAMPLE_R234.fastq.gz

if gunzip -cd SAMPLE_R234.fastq.gz | grep -q "singleton" then echo singletons were found therefore one or multiple fastqs are broken fi

I still have lower gene and UMI counts per cell, roughly 2 x less than with zUMIs pipeline, which also uses STAR for mapping, then feature counts for counting. I am not too sure why that is, as the mapping rates are quite similar (roughly 70 % uniquely mapped, 15% multi mapping and 8% too short). Let me know if you notice something similar? I would guess the 2-pass mapping in zUMIs might have bumped the number up slightly. Also the zUMIs values are with introns and axons, which I believe is also the case with STARSolo's GeneFull.

Al the best, Jo

mariusmessemaker commented 4 years ago

Hi Jo,

Great that the pipeline works for you! My cell clusters and cell annotation also compare well to data that was processed with the indrops.py pipeline (I am working on a benchmarking that I will post here but unfortunately I have other projects that have priority right now).

I still have lower gene and UMI counts per cell, roughly 2 x less than with zUMIs pipeline, which also uses STAR for mapping, then feature counts for counting. I am not too sure why that is, as the mapping rates are quite similar (roughly 70 % uniquely mapped, 15% multi mapping and 8% too short).

Yes, I also noticed that STARSolo counts less reads than zUMIs (and indrops.py) with similar mapping rates. Therefore, the lower gene count can be explained by the difference in how STARSolo (and Celllranger) assign reads compared to featureCounts (and indrops.py). STARSolo (and Cellranger) are more conservative and only assign aligned reads that are entirely contained in the gene exons (or follow it's splice junctions) while featureCounts counts all reads that overlap exons (as Alex Dobin explains here). In addition, you should only compare --soloFeatures Gene to zUMIs that counted reads that aligned to exons, and --soloFeatures GeneFull to zUMIs that counted reads that aligned to both introns and exons.

Although your clustering and annotation shows that we probably still have the same biological signal, it will be interesting to make (essentially the most important benchmarking plots I want to make to compare STARSolo and indrops.py):

  1. Scatterplot where you compare UMI counts for each gene between --soloFeatures Gene and zUMIs that counted reads that aligned to exons (such as here).

  2. Scatterplot where you compare the number of expressed genes per cell between --soloFeatures Gene and zUMIs that counted reads that aligned to exons (such as here).

  3. Scatterplot where you compare total UMI counts for each barcode between --soloFeatures Gene and zUMIs that counted reads that aligned to exons (such as here).

If you share these plots here, we can assess the read assignment differences between the zUMIs and STARSolo pipeline.

PS: If you have multiple cores, you might want to use pigz to decompress to make your pipeline faster: https://zlib.net/pigz/.

Qotov commented 4 years ago

I think one of the reasons for the low UMI/genes per cell is that InDrops pipeline allows 2 mismatches in the barcode, and STARsolo only one. Less valid barcodes, less reads, fewer UMIs.

Please have a look at InDrops code https://github.com/indrops/indrops/blob/master/indrops.py#L97

@alexdobin could you please add two mismatches for validation of barcodes?

dakota-hawkins commented 4 years ago

@Qotov As a work around, it would likely be relatively easy to just write your own script that corrects fastq barcodes to whitelist barcodes if the hamming distance <= 2.

alexdobin commented 4 years ago

Hi Alex, Dakota,

do you know the magnitude of the effect - how many more reads are recovered with two mismatches? One of the dangers of overcorrecting is to create artificial doublets - is it worth it to recover just a few more % fo the reads?

Cheers Alex

Qotov commented 4 years ago

I compared STARsolo with dropEst (I used STAR as aligner in the dropEst pipeline and got an almost similar percentage of aligned reads). dropEst uses a hamming distance <=2 for barcode correction as default.

image

alexdobin commented 4 years ago

Hi Alex,

not sure what's plotted here - is it the total number of genic counts per cell, with two methods? The difference seems to be too big to be explained by the barcode Hamming distance. Do you know how does dropEst assign reads to genes? I know that many pipelines use simple overlap of reads with exons. STARsolo requires that reads are contained entirely within exons - a much more stringent requirement that will exclude the reads that cross exon/intron boundaries.

Cheers Alex

Qotov commented 4 years ago

Hi Alex,

it is the total number of counts per cell for two methods.

dropEst as default counts UMIs with exonic reads only + which have both exonic and not annotated reads + which have both exonic and intronic reads + which have exonic, intronic and not annotated reads.

However, for STARsolo I used --soloFeatures GeneFull

alexdobin commented 4 years ago

Hi Alex,

--soloFeatures GeneFull should have captured all kinds of reads... If we want to dig into it, we would need to find some examples of the discrepancies. Does dropEst output a SAM file with corrected barcodes?

Cheers Alex