alexdobin / STAR

RNA-seq aligner
MIT License
1.77k stars 497 forks source link

STARsolo CB_samTagOut mode error in combo with STAR diploid #2058

Open mparker2 opened 4 months ago

mparker2 commented 4 months ago

Hi @alexdobin ,

Thanks again for maintaining STAR. I am trying to use STAR to map single cell ATAC reads against a genome generated using the genome parameter --genomeTransformType Diploid using the mapping parameter --soloType CB_samTagOut. However this results in the error:

EXITING because of FATAL GENOME INDEX FILE error: transcriptInfo.tab is corrupt, or is incompatible with the current STAR version
SOLUTION: re-generate genome index
Feb 09 11:43:59 ...... FATAL ERROR, exiting

I used the following parameters to generate the genome:

        STAR \
          --runThreadN 16 \
          --runMode "genomeGenerate" \
          --genomeDir "{output}" \
          --genomeFastaFiles "{input.fasta_fn}" \
          --sjdbGTFfile "{input.gtf_fn}" \
          --genomeTransformVCF "{input.vcf_fn}" \
          --genomeTransformType "Diploid" \
          --genomeSAindexNbases 12 \
          --sjdbOverhang 149

Which runs successfully:

    STAR version: 2.7.11a   compiled: 2023-09-15T02:58:53+0000 :/opt/conda/conda-bld/star_1694746407721/work/source
Feb 09 12:05:45 ..... started STAR run
Feb 09 12:05:45 ... starting to generate Genome files
Feb 09 12:05:46 ..... processing annotations GTF
Feb 09 12:05:53 ... starting to sort Suffix Array. This may take a long time...
Feb 09 12:05:54 ... sorting Suffix Array chunks and saving them to disk...
Feb 09 12:07:59 ... loading chunks from disk, packing SA...
Feb 09 12:08:03 ... finished generating suffix array
Feb 09 12:08:03 ... generating Suffix Array index
Feb 09 12:08:16 ... completed Suffix Array index
Feb 09 12:08:16 ..... inserting junctions into the genome indices
Feb 09 12:10:11 ... writing Genome to disk ...
Feb 09 12:10:14 ... writing Suffix Array to disk ...
Feb 09 12:10:37 ... writing SAindex to disk
Feb 09 12:10:38 ..... finished successfully
Feb 09 12:10:38 ... starting to generate Genome files
Feb 09 12:10:40 ..... processing annotations GTF
Feb 09 12:10:42 ... starting to sort Suffix Array. This may take a long time...
Feb 09 12:10:42 ... sorting Suffix Array chunks and saving them to disk...
Feb 09 12:11:12 ... loading chunks from disk, packing SA...
Feb 09 12:11:14 ... finished generating suffix array
Feb 09 12:11:14 ... generating Suffix Array index
Feb 09 12:11:23 ... completed Suffix Array index
Feb 09 12:11:23 ..... inserting junctions into the genome indices
Feb 09 12:12:13 ... writing Genome to disk ...
Feb 09 12:12:14 ... writing Suffix Array to disk ...
Feb 09 12:12:34 ... writing SAindex to disk
Feb 09 12:12:36 ..... finished successfully

followed by this command to map the reads:

        STAR \
          --runThreadN 16 \
          --genomeDir "{input.index}" \
          --readFilesIn "{input.read}" "{input.mate}" "{input.barcode}" \
          --readFilesCommand "zcat" \
          --soloType "CB_samTagOut" \
          --soloCBmatchWLtype "1MM" \
          --soloCBwhitelist "{input.barcode_whitelist}" \
          --soloBarcodeReadLength 0 \
          --outFilterMultimapNmax 1 \
          --outFilterMismatchNmax 4 \
          --alignIntronMax 1 \
          --alignMatesGapMax 1000 \
          --outSAMtype BAM Unsorted \
          --outSAMattributes NH HI AS nM CB sS sQ ha \
          --genomeTransformOutput SAM

Which produces the error.

    STAR version: 2.7.11a   compiled: 2023-09-15T02:58:53+0000 :/opt/conda/conda-bld/star_1694746407721/work/source
Feb 09 12:12:47 ..... started STAR run
Feb 09 12:12:47 ..... loading genome

EXITING because of FATAL GENOME INDEX FILE error: transcriptInfo.tab is corrupt, or is incompatible with the current STAR version
SOLUTION: re-generate genome index
Feb 09 12:13:16 ...... FATAL ERROR, exiting

I am confident the genome index is not corrupt, because I can use the same index to map single cell RNA reads using the command:

        STAR \
          --runThreadN 16 \
          --genomeDir "{input.index}" \
          --readFilesIn "{input.mate}" "{input.read_barcode}" \
          --readFilesCommand "zcat" \
          --soloType "CB_UMI_Simple" \
          --soloCBwhitelist "{input.barcode_whitelist}" \
          --soloUMIlen 12 \
          --soloUMIdedup "1MM_Directional_UMItools" \
          --outFilterMultimapNmax 2 \
          --outFilterIntronMotifs "RemoveNoncanonical" \
          --alignSJoverhangMin 12 \
          --alignSJDBoverhangMin 4 \
          --outFilterMismatchNmax 2 \
          --alignIntronMin 60 \
          --alignIntronMax 20000 \
          --outSAMtype "BAM" "SortedByCoordinate" \
          --outBAMsortingBinsN 150 \
          --limitBAMsortRAM {params.sort_mem} \
          --outSAMattributes NH HI AS nM CB UB ha \
          --genomeTransformOutput SAM SJ Quant

This command seems to have no problem loading the genome and begins mapping normally (although I killed the process once the genome appeared to have been loaded successfully).

    STAR version: 2.7.11a   compiled: 2023-09-15T02:58:53+0000 :/opt/conda/conda-bld/star_1694746407721/work/source
Feb 09 12:54:08 ..... started STAR run
Feb 09 12:54:09 ..... loading genome
Feb 09 12:54:10 ..... started mapping
^C

Many thanks Matt

alexdobin commented 4 months ago

Hi Matt,

I could not reproduce the error on my datasets. Genome generation: STAR --runMode genomeGenerate --runThreadN 20 --genomeDir ./ --genomeFastaFiles /mnt/scratch0/dobin/Genomes/Human/GRCh38_Gencode39/GRCh38.primary_assembly.genome.fa --sjdbGTFfile /mnt/scratch0/dobin/Genomes/Human/GRCh38_Gencode39/gencode.v39.primary_assembly.annotation.gtf --genomeTransformType Diploid --genomeTransformVCF ../../Genome_NA19238_SNP/NA19238.vcf --genomeSuffixLengthMax 300 Mapping: STAR --runThreadN 16 --outFileNamePrefix ./Runs/soloAtacDiploid/ --genomeDir ./Genomes/Diploid_NA19238 --soloCBwhitelist ./Data/737K-cratac-v1.txt --readFilesPrefix ./Data/atac_pbmc_500_nextgem_fastqs/ --readFilesIn atac_pbmc_500_nextgem_S1_L001_R1_001.fastq.gz atac_pbmc_500_nextgem_S1_L001_R3_001.fastq.gz atac_pbmc_500_nextgem_S1_L001_R2_001.fastq.gz --readFilesCommand zcat --soloType CB_samTagOut --soloCBmatchWLtype 1MM --soloBarcodeReadLength 0 --outFilterMultimapNmax 1 --outFilterMismatchNmax 4 --alignIntronMax 1 --alignMatesGapMax 1000 --outSAMtype BAM Unsorted --outSAMattributes NH HI AS nM CB sS sQ ha --genomeTransformOutput SAM

mparker2 commented 4 months ago

Hi @alexdobin ,

Thanks for trying this. Could this issue be caused by large Indels in my VCF? I have some indels that remove whole genes/transcript models, and I noticed that the number which is reported at the top of the transcriptInfo.tab file, which normally seems to match the number of records in the file, does not match for my STAR diploid generated file, but instead is double the number of records in the OriginalGenome/transcriptInfo.tab (which is what would be expected if all gene models from haplotype 1 were also present in haplotype 2):

$ head OriginalGenome/transcriptInfo.tab 
53728
AT1G01010.1 3630    5898    5898    1   6   0   0
AT1G01020.2 6787    8736    5898    2   8   6   1
AT1G01020.6 6787    8736    8736    2   6   14  1
AT1G01020.1 6787    9129    8736    2   9   20  1
AT1G01020.3 6787    9129    9129    2   8   29  1
AT1G01020.4 6787    9129    9129    2   8   37  1
AT1G01020.5 6787    9129    9129    2   9   45  1
AT1G03987.1 11100   11371   9129    1   1   54  2
AT1G01030.1 11648   13713   11371   2   2   55  3

$ wc -l OriginalGenome/transcriptInfo.tab 
53729 OriginalGenome/transcriptInfo.tab

$ head transcriptInfo.tab 
107456
AT1G01010.1_h1  3630    5898    5898    1   6   0   0
AT1G01020.2_h1  6787    8736    5898    2   8   6   1
AT1G01020.6_h1  6787    8736    8736    2   6   14  1
AT1G01020.1_h1  6787    9129    8736    2   9   20  1
AT1G01020.3_h1  6787    9129    9129    2   8   29  1
AT1G01020.4_h1  6787    9129    9129    2   8   37  1
AT1G01020.5_h1  6787    9129    9129    2   9   45  1
AT1G03987.1_h1  11100   11371   9129    1   1   54  2
AT1G01030.1_h1  11648   13713   11371   2   2   55  3

$ wc -l transcriptInfo.tab 
106925 transcriptInfo.tab

Not sure why this would specifically be a problem for the CB_samTagOut mode though?

Happy to share my STAR index, VCF file and/or some reads if needed!

Best Matt

mparker2 commented 4 months ago

I think this discrepancy is the source of the issue: if I manually changed the number at the top of the diploid transcriptInfo.tab file to match the actual number of records in the file, then STAR loads the index without any complaint and performs mapping successfully.

alexdobin commented 4 months ago

Hi Matt:

You have found the bug! Indeed, if the transcripts are completely deleted, the number has to be updated. I will fix it in the next release. I have not tested transformed genome with large deletions, so you may encounter other issues.

mparker2 commented 4 months ago

So far not encountered any other problems, but will let you know! STAR diploid + STARsolo combo is amazing 🤩