alexdobin / STAR

RNA-seq aligner
MIT License
1.77k stars 495 forks source link

Interaction between STARsolo and STAR diploid #2112

Open mparker2 opened 2 months ago

mparker2 commented 2 months ago

Dear @alexdobin,

I've noticed that reads that are assigned to haplotype 2 by STAR diploid are not processed in the same way as haplotype 1 reads by STARsolo. For instance, I am fairly confident that only reads that map to haplotype 1 or both haplotypes equally well are used for producing STARsolo gene x cell count matrices. Also the UB tags on reads that align to haplotype 2 do not seem to be deduplicated properly and still match the UR tags (tested using the "1MM_Directional_UMItools). This could lead to biases in downstream analyses using either the count matrices or UMIs for quantification, if users are not aware of the limitations of using STARsolo and STAR diploid at the same time.

Best wishes Matt

alexdobin commented 2 months ago

Hi Matt, This could be a bug. Can you extract a few reads that map to hap2 and map them separately to confirm. I will look into it at some point, but not sure when I will have time.

mparker2 commented 2 months ago

Hi Alex,

Thanks for the reply! Heres a bit of evidence for the lack of UMI deduplication on hap2 reads. Across a BAM file of scRNA-seq reads mapped with STAR diploid + STARsolo, using the UMI dedup method 1MM_Directional_UMItools, there is a large difference in the UMI count distributions for ha:2 reads compared to ha:1 and ha:0, with ha:2 reads having many more UMIs that are supported by only a single read. This is probably an indicator of lack of deduplication:

image

Another thing pointing to this is that if I compare the uncorrected UMI tag UR to the UB tag for each read in the BAM file, then for ha:0 and ha:1 tags I find they differ approximately 1.27% of the time (to be precise, 1.278% for ha:0 and 1.269% for ha:1), indicating the UB has been corrected for a small proportion of the reads. For ha:2 reads only 0.004% of UB and UR tags differ, indicating they are not getting corrected at anywhere near the same rate.

Demonstrating this for the count matrices is slightly more tricky as I can't trace the counts back to which reads they originated from, but I can try to generate some examples as you suggest.

mparker2 commented 2 months ago

Hi Alex, I've now done some simulations to test whether hap2 reads from STAR diploid are counted in STARsolo count matrices, & find that they are not currently being counted properly.

Sorry for deleting and reposting... I have removed all the unnecessary code from the post I initially uploaded, to keep it simple. If you need I can share all the code and simulated data for you to retest.

I simulated a 10 kb reference sequence with three 1kb intronless genes, and added ~100 annotated SNPs to each gene as haplotype 2:

fold -w 80 ref.fa | head

    >Chr1
    GGGGCATCCCTCAGTGGGTGTTATCCGGCTTCGAATCCCATCCATCCTACTGGCAAATGGCAAAAATTGTTGTGAAGGGA
    GCTTATAGGGGCTCCGTGTAGGAGGATTGGTCAACATCAGACCGCTCGGCGTCTATGAGGCTGATATATTGTTCGTTATC
    AGATTATTGAATTGGAACAGCAATTGATTTGCCGCCATCATGTACTTAACTCAAAGTGCGGTGAAGTTGTATCCAAGTTA
    GAGCGTGGTGAGTTGCCCCGATTCGAAGGCTTAGGATGCATTAGTAATCAGAGACTTTCTAGGCACGATAAGTAAGGCGA
    GTGTATCGCATGGTGTAGGTTGGTTGCTGGGAATCTCGGAGAGTGGGTCCGCGAGTGCTTGTTGTCCCTGGGTCCATAAG
    ACCTCCCCAGACTTCTGTAGACCCCATGATTTGCTACCCTTTCACACTTCAACGGCCGCCTGGTACTACTGGGAAGAGGG
    GAGAGATCCCCACAGCCTAGCTGTCGTAAAATAAGGCGGACCGTTAAGCTACTCTCTCAATTGGTGGCCCCAGTTTCGGC
    CAGCTTGAACGGGTTTGTGGGTAGAATGCACCTATCCGGCAATTCGATGGCTGTGCCTCTTTCACCCATCTCGCTATGGC
    GATTTGTAGGCGCGTGAAGCGTCAGGTGAACGACGAACTCCCGTATACGCTATGAAGCCCCTGACTTAGACGAAAATCCT

head annot.gtf

    Chr1    sim transcript  1001    2000    .   +   .   gene_id "gene1"; transcript_id "gene1.1";
    Chr1    sim exon    1001    2000    .   +   .   gene_id "gene1"; transcript_id "gene1.1";
    Chr1    sim transcript  3001    4000    .   +   .   gene_id "gene2"; transcript_id "gene2.1";
    Chr1    sim exon    3001    4000    .   +   .   gene_id "gene2"; transcript_id "gene2.1";
    Chr1    sim transcript  5001    7000    .   +   .   gene_id "gene3"; transcript_id "gene3.1";
    Chr1    sim exon    5001    7000    .   +   .   gene_id "gene3"; transcript_id "gene3.1";

head var.vcf

    ##fileformat=VCFv4.3
    ##source=sim
    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HAP2
    Chr1    1003    snp1    A   T   .   PASS    .   GT  0|1
    Chr1    1010    snp2    C   A   .   PASS    .   GT  0|1
    Chr1    1013    snp3    A   G   .   PASS    .   GT  0|1
    Chr1    1014    snp4    G   C   .   PASS    .   GT  0|1
    Chr1    1022    snp5    T   G   .   PASS    .   GT  0|1
    Chr1    1025    snp6    C   A   .   PASS    .   GT  0|1
    Chr1    1046    snp7    C   T   .   PASS    .   GT  0|1

These were used to generate a diploid index using the following command:

STAR \
  --runThreadN 2 \
  --runMode genomeGenerate \
  --genomeDir STAR_index \
  --genomeFastaFiles ref.fa \
  --sjdbGTFfile annot.gtf \
  --genomeTransformVCF var.vcf \
  --genomeTransformType Diploid \
  --genomeSAindexNbases 5 \
  --sjdbOverhang 150

        STAR --runThreadN 2 --runMode genomeGenerate --genomeDir STAR_index --genomeFastaFiles ref.fa --sjdbGTFfile annot.gtf --genomeTransformVCF var.vcf --genomeTransformType Diploid --genomeSAindexNbases 5 --sjdbOverhang 150
        STAR version: 2.7.11b   compiled:  :/Users/distiller/project/STARcompile/source
    Apr 25 13:32:47 ..... started STAR run
    Apr 25 13:32:47 ... starting to generate Genome files
    Apr 25 13:32:47 ..... processing annotations GTF
    Apr 25 13:32:47 ... starting to sort Suffix Array. This may take a long time...
    Apr 25 13:32:47 ... sorting Suffix Array chunks and saving them to disk...
    Apr 25 13:32:47 ... loading chunks from disk, packing SA...
    Apr 25 13:32:47 ... finished generating suffix array
    Apr 25 13:32:47 ... generating Suffix Array index
    Apr 25 13:32:47 ... completed Suffix Array index
    Apr 25 13:32:47 ... writing Genome to disk ...
    Apr 25 13:32:47 ... writing Suffix Array to disk ...
    Apr 25 13:32:47 ... writing SAindex to disk
    Apr 25 13:32:47 ..... finished successfully
    Apr 25 13:32:47 ... starting to generate Genome files
    Apr 25 13:32:47 ..... processing annotations GTF
    Apr 25 13:32:47 ... starting to sort Suffix Array. This may take a long time...
    Apr 25 13:32:47 ... sorting Suffix Array chunks and saving them to disk...
    Apr 25 13:32:47 ... loading chunks from disk, packing SA...
    Apr 25 13:32:47 ... finished generating suffix array
    Apr 25 13:32:47 ... generating Suffix Array index
    Apr 25 13:32:47 ... completed Suffix Array index
    Apr 25 13:32:47 ... writing Genome to disk ...
    Apr 25 13:32:47 ... writing Suffix Array to disk ...
    Apr 25 13:32:47 ... writing SAindex to disk
    Apr 25 13:32:47 ..... finished successfully

I then simulated 10 cells, 5 from each haplotype, and for each I simulate 100 reads (with unique UMIs) per gene, to give a total of 1500 hap1 reads, and 1500 hap2 reads.


STAR \
  --runThreadN 2 \
  --genomeDir "STAR_index" \
  --readFilesIn read.2.fastq read.1.fastq \
  --soloType "CB_UMI_Simple" \
  --soloCBwhitelist cb_whitelist.txt \
  --soloUMIlen 12 \
  --soloUMIdedup "Exact" \
  --outSAMtype SAM \
  --outSAMattributes NH HI AS nM ha \
  --genomeTransformOutput SAM

        STAR --runThreadN 2 --genomeDir STAR_index --readFilesIn read.2.fastq read.1.fastq --soloType CB_UMI_Simple --soloCBwhitelist cb_whitelist.txt --soloUMIlen 12 --soloUMIdedup Exact --outSAMtype SAM --outSAMattributes NH HI AS nM ha --genomeTransformOutput SAM
        STAR version: 2.7.11b   compiled:  :/Users/distiller/project/STARcompile/source
    Apr 25 13:32:47 ..... started STAR run
    Apr 25 13:32:47 ..... loading genome
    Apr 25 13:32:47 ..... started mapping
    Apr 25 13:32:48 ..... finished mapping
    Apr 25 13:32:48 ..... started Solo counting
    Apr 25 13:32:48 ..... finished Solo counting
    Apr 25 13:32:48 ..... finished successfully

100% of reads align to the genome, nearly 100% of them are able to be assigned to a specific haplotype:

cat Log.final.out

                                     Started job on |   Apr 25 13:32:47
                                 Started mapping on |   Apr 25 13:32:47
                                        Finished on |   Apr 25 13:32:48
           Mapping speed, Million of reads per hour |   10.80

                              Number of input reads |   3000
                          Average input read length |   150
                                        UNIQUE READS:
                       Uniquely mapped reads number |   3000
                            Uniquely mapped reads % |   100.00%
                              Average mapped length |   150.00
                           Number of splices: Total |   0
                Number of splices: Annotated (sjdb) |   0
                           Number of splices: GT/AG |   0
                           Number of splices: GC/AG |   0
                           Number of splices: AT/AC |   0
                   Number of splices: Non-canonical |   0
                          Mismatch rate per base, % |   0.00%
                             Deletion rate per base |   0.00%
                            Deletion average length |   0.00
                            Insertion rate per base |   0.00%
                           Insertion average length |   0.00
                                 MULTI-MAPPING READS:
            Number of reads mapped to multiple loci |   0
                 % of reads mapped to multiple loci |   0.00%
            Number of reads mapped to too many loci |   0
                 % of reads mapped to too many loci |   0.00%
                                      UNMAPPED READS:
      Number of reads unmapped: too many mismatches |   0
           % of reads unmapped: too many mismatches |   0.00%
                Number of reads unmapped: too short |   0
                     % of reads unmapped: too short |   0.00%
                    Number of reads unmapped: other |   0
                         % of reads unmapped: other |   0.00%
                                      CHIMERIC READS:
                           Number of chimeric reads |   0
                                % of chimeric reads |   0.00%

grep -o -E "ha:i:\d" Aligned.out.sam | sort | uniq -c

      18 ha:i:0
    1489 ha:i:1
    1493 ha:i:2

If we look at the count matrix for the cells with reads simulated from haplotype 1, they are quantified perfectly:

TCCTCGTTGCAGTAGG TGCAGAGGTAGTCACT ACACATAAGATCGCAC TAAGCTTCATCAGTTT GCCTTGTCATTGGAGT
gene1 100.0 100.0 100.0 100.0 100.0
gene2 100.0 100.0 100.0 100.0 100.0
gene3 100.0 100.0 100.0 100.0 100.0

But for the haplotype 2 cells, the reads that were assigned to hap2 are not counted at all. The low counts that we do see probably come from reads that are ambiguous (could be hap1 or hap2)

TCGCTCTGCCCCTAGA CCGTGAAATCCCAGGG AAAGCGTGTAATGGCC GGGTTGGCCAGATTGC ACACACGCACAGCTTA
gene1 0.0 0.0 0.0 0.0 0.0
gene2 0.0 0.0 0.0 0.0 0.0
gene3 4.0 0.0 0.0 3.0 0.0