alexdobin / STAR

RNA-seq aligner
MIT License
1.82k stars 503 forks source link

Transcriptome vs Genome mapping when using artificial reference #1238

Open tatyana-perlova opened 3 years ago

tatyana-perlova commented 3 years ago

We use STAR Solo to map results of CRISPR screens, specifically at one of the steps we map guide RNA reads that we have introduced and assign them to cells . For that we generate artificial reference, where each chromosome is individual guide and its full length is marked as exon. Example fasta and gtf files that are used to generate reference for STAR are below. ref.fa

>guide_1
GAAAGGACGAAACACCGGCTCGATCCGGTACGCTTAGTTTCAGAGCTATGC
>guide_2
GAAAGGACGAAACACCGAGATGCGCTACGACGCGTAGTTTCAGAGCTATGC
>guide_3
GAAAGGACGAAACACCGAAAGCTTAACGCGCGCGATGTTTCAGAGCTATGC

ref.gtf

guide_1 guide_panel exon    1   51  .   +   .   gene_id "guide_1"; transcript_id "guide_1"; gene_name "guide_1"; gene_type "gRNA"
guide_2 guide_panel exon    1   51  .   +   .   gene_id "guide_2"; transcript_id "guide_2"; gene_name "guide_2"; gene_type "gRNA"
guide_3 guide_panel exon    1   51  .   +   .   gene_id "guide_3"; transcript_id "guide_3"; gene_name "guide_3"; gene_type "gRNA"

Because genome in this case is the same as transcriptome we expect the same fraction of reads to map to genome and transcriptome. However that is not the case. E.g. here is the STAR Solo Summary output for Gene:

Reads Mapped to Genome: Unique+Multiple,0.542601
Reads Mapped to Genome: Unique,0.542601
Reads Mapped to Transcriptome: Unique+Multipe Genes,0.4559
Reads Mapped to Transcriptome: Unique Genes,0.4559

and GeneFull:

Reads Mapped to Genome: Unique+Multiple,0.542601
Reads Mapped to Genome: Unique,0.542601
Reads Mapped to Transcriptome: Unique+Multipe GeneFulls,0.4559
Reads Mapped to Transcriptome: Unique GeneFulls,0.4559

To troubleshoot this I ran STAR Solo with --quantMode TranscriptomeSAM to save reads that align to transcriptome in separate file. I then compared Aligned.out.bam and Aligned.toTranscriptome.out.bam. But all the comparison methods (bam diff, picard CompareSAMs and comm on sorted read names) give me empty result. samtools view -F 4 -c also gives me the same number of records on both files. So it seems that there are no reads that map on to genome, but not to transcriptome. But then why is there a discrepancy in the Summary.csv and why Aligned.out.bam is slightly bigger than Aligned.toTranscriptome.out.bam - e.g. 6.1 Gb vs 5.8 Gb? Thank you!

alexdobin commented 3 years ago

Hi Tatyana,

this is an interesting application of STARsolo. :)

The Reads Mapped to Transcriptome: Unique Genes represent the reads that map to unique genes AND have valid barcodes, i.e. those reads that contribute to UMI counts in the gene/count matrix. That's why they are lower than the % of reads mapped to the genome, which is calculated for all reads.

You can check the more detailed statistics in Solo.out/Barcodes.stats for reads that do not have barcodes:

nNoAdapter
nNoUMI
nNoCB
nNinCB
nNinUMI
nUMIhomopolymer
nTooMany
nNoMatch

Solo.out/Gene/Features.stats counts reads with barcodes mapped (or not) to genes: nMatchUnique: this is the count that's divided by the total number of reads to get Reads Mapped to Transcriptome: Unique Genes.

nUnmapped: unmapped reads nNoFeature: mapped, but not assigned to genes - should be 0 in your case nAmbigFeature: mapped to multiple genes

Cheers Alex

umbers reported in Summary.csv

tatyana-perlova commented 3 years ago

Hi Alex, Thank you for your reply! And thank you for your continuing support and development of this great tool.

I think that's our workaround to use STARsolo for multi-modal data. We first map all reads to full transcriptome reference and then take only unmapped reads and map them to artificial references, e.g. gRNAs, citeseq labels etc.

Ok, that makes perfect sense. I don't have nMatchUnique in Features.stats, but I have nMatch, which when divided by the total number of reads gives me Reads Mapped to Transcriptome: Unique Genes. nNoFeature is very small - on the order of 10-100 reads, but nonzero. nUnmapped doesn't match the total number of unmapped reads, but the number of unmapped reads with valid barcode, is that right?

Thanks again! Best, Tatyana

alexdobin commented 3 years ago

Hi Tatyana,

I think that's our workaround to use STARsolo for multi-modal data. We first map all reads to full transcriptome reference and then take only unmapped reads and map them to artificial references, e.g. gRNAs, citeseq labels etc.

This is a very neat hack for multi-modal data! It's high on my TODO list to implement it so it can be done in one pass. My hope is to code it in a way that would allow for many different modes with a simple change of parameters.

nUnmapped doesn't match the total number of unmapped reads, but the number of unmapped reads with valid barcode, is that right?

Correct!

nNoFeature is very small - on the order of 10-100 reads, but nonzero.

Are you prohibiting splicing when you map to sgRNA "reference"? If not, I imagine a few reads manage to get spliced, and so they are mapped to the "genome" but not concordant with annotations.

Cheers Alex

andrewjkwok commented 3 years ago

Hi @tatyana-perlova,

Just wanted to ask - when you take the unmapped reads and map them to artificial references like citeseq labels - what were the settings that you used, both for generating the citeseq reference, and the subsequent mapping? For example, I'm wondering that the cell filtering step would be different to when mapping to the transcriptome, and I'm guessing multi-mapping would be set to none?

tatyana-perlova commented 3 years ago

Hi @andrewjkwok , When creating the reference I scale –genomeSAindexNbases down to min(14, log2(GenomeLength)/2 - 1) as per STAR manual. I disable cell filtering completely, as I can just use cells filtered using transcriptome reads when assigning labels. Additionally I disable splicing and multimapping with:

--alignIntronMax 1
--alignMatesGapMax 1
--winAnchorMultimapNmax 1
--chimMainSegmentMultNmax 1
--outFilterMultimapNmax 1
--seedMultimapNmax 1

You also have to be careful that the flanking sequences included in the reference are not too long compared to the labels themself otherwise depending on the mismatch rate you may get erroneously mapped labels as they map to the constant regions and all mismatches are concentrated in the variable region.

andrewjkwok commented 3 years ago

Thanks, that's very helpful. I'm mapping BD Rhapsody Abseq reads specifically which might make it a bit different to the CITE-seq / CRISPR guide RNA reads that you have, but I was wondering whether you had any issues with a very high "Unmapped short" percentage of reads (I'm getting between 50-60%)?

When I manually set--outFilterScoreMinOverLread and --outFilterMatchNminOverLread to 0 the "Unmapped short" rate goes to 0%, but then the "Mapped Length" of the reads is shorter than expected, so it seems like 0 for those two options is too lax. Adjusting the --outFilterMatchNminOverLread option seems to be the answer, but then I'm not sure how to assess what threshold is too stringent vs. what is too lax?

tatyana-perlova commented 3 years ago

Sure! I do get high percentage of reads that are "Unmapped too short", but it's not really a problem in my case because I only expect a tiny fraction of original reads to map to guides/CITE-seq labels anyway.

I do set both --outFilterScoreMinOverLread and --outFilterMatchNminOverLread to 0. However instead I set --outFilterMatchNmin to ~0.75 reference_length = 0.75 (flanking_length*2 + guide_length).

alexdobin commented 3 years ago

The deafult values of --outFilterScoreMinOverLread and --outFilterMatchNminOverLread are a usually a reasonable compromise between sensitivity and precision. The best way to deal with low mapping % is to figure out its cause. The first thing I would suggest is to trim adapters - if your reads are longer than the library inserts, it can reduce mappability.

ktpolanski commented 2 years ago

Found this issue by searching for "cite" within the issues.

I would just like to provide another voice of support for adding CITE-seq support for starsolo. Moving away from cellranger has been an incredible experience for GEX libraries, but the lab has started a push for CITE-seq along with scaling up data production. As such, I've got more data to process than ever, and I'm stuck with cellranger for it.

alexdobin commented 2 years ago

It's getting very close to the top of my TODO list, but still a few months away...

theheking commented 2 years ago

Hi @andrewjkwok did you ever resolve the error of the high percentage of "Unmapped too short" with the Rhapsody pipeline. After setting both --outFilterScoreMinOverLread and --outFilterMatchNminOverLread to zero I used the --outFilterMatchNmin flag suggestion from tatyana above, however, this recreated the same error of high % of unmapped too short. Cheers

alexdobin commented 2 years ago

Hi @theheking

low mappability of the reads is a separate problem, it probably is not related to the specifics of the barcoding technology. Could you please post the Log.final.out file?

theheking commented 2 years ago

Yes of course! Sorry I should be clearer- I am having issues with high % of unmapped to genes, not genome. As well as low UMI count. Here is the Log.final.out:

` Started job on | May 01 09:06:46 Started mapping on | May 01 09:08:31 Finished on | May 01 10:26:05 Mapping speed, Million of reads per hour | 349.30

                      Number of input reads |   451565428
                  Average input read length |   72
                                UNIQUE READS:
               Uniquely mapped reads number |   352651645
                    Uniquely mapped reads % |   78.10%
                      Average mapped length |   71.98
                   Number of splices: Total |   23318132
        Number of splices: Annotated (sjdb) |   8437
                   Number of splices: GT/AG |   22117253
                   Number of splices: GC/AG |   148419
                   Number of splices: AT/AC |   14054
           Number of splices: Non-canonical |   1038406
                  Mismatch rate per base, % |   0.97%
                     Deletion rate per base |   0.02%
                    Deletion average length |   1.58
                    Insertion rate per base |   0.01%
                   Insertion average length |   1.44
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   82098633
         % of reads mapped to multiple loci |   18.18%
    Number of reads mapped to too many loci |   3283121
         % of reads mapped to too many loci |   0.73%
                              UNMAPPED READS:

Number of reads unmapped: too many mismatches | 0 % of reads unmapped: too many mismatches | 0.00% Number of reads unmapped: too short | 8851563 % of reads unmapped: too short | 1.96% Number of reads unmapped: other | 4680466 % of reads unmapped: other | 1.04% CHIMERIC READS: Number of chimeric reads | 0 % of chimeric reads | 0.00%`

However, the Barcodes.stats file shows really high NoUMI count

which when I look into the Summary.csv file shows really low mappability to Gene, not Genome and low UMI count.

and the Feature.stats:

                                     nUnmapped       12676922
                                    nNoFeature      372846470
                                 nAmbigFeature           2788
                         nAmbigFeatureMultimap           2788
                                      nTooMany              0
                                 nNoExactMatch             13
                                   nExactMatch          50898
                                        nMatch          58026
                                  nMatchUnique          55238
                                 nCellBarcodes           5191
                                         nUMIs          12721 

The STAR solo run I ran: STAR --runMode alignReads --genomeDir /mm10_STAR_index/STAR/ --readFilesCommand zcat --readFilesIn INDEXLIBNOVASEQ_R2_001.trimmed.fastq.gz INDEXLIBNOVASEQ_R1_001.trimmed.fastq.gz --runThreadN 8 --outFileNamePrefix BASIC --outSAMtype BAM SortedByCoordinate --soloType CB_UMI_Complex --soloUMIposition 0_52_0_59 --soloCBposition 0_0_0_8 0_21_0_29 0_43_0_51 --soloUMIlen 8 --soloCBwhitelist whitelist_CSL.txt whitelist_CSL2.txt whitelist_CSL3.txt --soloCBmatchWLtype 1MM

Examples R1: @A00488:217:H2GNCDRXY:1:1101:1054:1000 1:N:0:GTAGAGGA GNGGTGCTAACTGGCCTGCGATTGGAGGTAGGTAGCGGTGACACAGCCTGGCAGGATTCTTTTTTTTTTTTTTTT + F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

R2: @A00488:217:H2GNCDRXY:1:1101:1054:1000 2:N:0:GTAGAGGA TNCAAGGGTGAAAAAGCGCCTGATACAAAGGAGAAGAAACCTGCGGCCAAGAAGGCTGGCAGTGATGCTGCTGCC + F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Any clues as to what I should change to increase the gene mapping rate and increase the low UMI count? Apologies in advance for the formatting and the excess information!

alexdobin commented 2 years ago

Hi @theheking

the overall mappability is decent (~78%) and the recovery of CellBarcodes is good (~85%), so the main issue that reads do not align to genes. The first thing to check is the strand, please try --soloStrand Reverse