alexdobin / STAR

RNA-seq aligner
MIT License
1.77k stars 497 forks source link

STAR diploid sometimes labels reads as haplotype specific in regions with no variants #2085

Open mparker2 opened 3 months ago

mparker2 commented 3 months ago

Hi @alexdobin ,

I came across an interesting corner case in my analysis using STAR diploid, which was run with the following command:

        STAR \
          --runThreadN {threads} \
          --genomeDir "$TOPDIR/{input.index}" \
          --readFilesIn "$TOPDIR/{input.read}" "$TOPDIR/{input.mate}" "$TOPDIR/{input.barcode}" \
          --readFilesCommand "zcat" \
          --soloType "CB_samTagOut" \
          --soloCBmatchWLtype "1MM" \
          --soloCBwhitelist "$TOPDIR/{input.barcode_whitelist}" \
          --soloBarcodeReadLength 0 \
          --outFilterMultimapNmax 1 \
          --outFilterMismatchNmax 0 \
          --alignMatesGapMax 500 \
          --alignIntronMax 1 \
          --outSAMtype BAM Unsorted \
          --outSAMattributes NH HI AS nM CB sS sQ ha \
          --genomeTransformOutput SAM

Because of the use of --outFilterMultimapNmax 1, the resulting BAM files only contain reads that can distinguish one haplotype from the other. These should all theoretically overlap the position of at least one of the variants in the VCF file used during indexing, however I found that a small number of haplotype-labelled alignments fall in regions that do not contain any variants at all:

Screenshot_2024-03-06_18-25-23

The top track shows the VCF that was used to build the STAR diploid index. The lower track shows the read alignments, grouped by ha tag. Variants from the region on the left (up to ~11,737 kb) have been masked, so no reads should align in this region. However there are some that still do.

Anecdotally this seems to mostly occur at repetitive elements or in the centromere. I was wondering if there is some stochasticity or heuristic in the mapping procedure, that could cause the reads to only map in one haplotype and not the other, even when the sequence is present in both? If this is the case, perhaps a check during the genomic coordinate transformation procedure could be used to prevent incorrect labelling of a unique haplotype?

Best wishes Matt

alexdobin commented 3 months ago

Hi Matt,

This should not happen, in principle. If you can extract a few reads like that from SAM, I can take a closer look.

mparker2 commented 3 months ago

Sure, attached. do you also need the indices used for mapping? sample.zip