Open mparker2 opened 4 months ago
Hi Matt,
I could not reproduce the error on my datasets.
Genome generation:
STAR --runMode genomeGenerate --runThreadN 20 --genomeDir ./ --genomeFastaFiles /mnt/scratch0/dobin/Genomes/Human/GRCh38_Gencode39/GRCh38.primary_assembly.genome.fa --sjdbGTFfile /mnt/scratch0/dobin/Genomes/Human/GRCh38_Gencode39/gencode.v39.primary_assembly.annotation.gtf --genomeTransformType Diploid --genomeTransformVCF ../../Genome_NA19238_SNP/NA19238.vcf --genomeSuffixLengthMax 300
Mapping:
STAR --runThreadN 16 --outFileNamePrefix ./Runs/soloAtacDiploid/ --genomeDir ./Genomes/Diploid_NA19238 --soloCBwhitelist ./Data/737K-cratac-v1.txt --readFilesPrefix ./Data/atac_pbmc_500_nextgem_fastqs/ --readFilesIn atac_pbmc_500_nextgem_S1_L001_R1_001.fastq.gz atac_pbmc_500_nextgem_S1_L001_R3_001.fastq.gz atac_pbmc_500_nextgem_S1_L001_R2_001.fastq.gz --readFilesCommand zcat --soloType CB_samTagOut --soloCBmatchWLtype 1MM --soloBarcodeReadLength 0 --outFilterMultimapNmax 1 --outFilterMismatchNmax 4 --alignIntronMax 1 --alignMatesGapMax 1000 --outSAMtype BAM Unsorted --outSAMattributes NH HI AS nM CB sS sQ ha --genomeTransformOutput SAM
Hi @alexdobin ,
Thanks for trying this. Could this issue be caused by large Indels in my VCF? I have some indels that remove whole genes/transcript models, and I noticed that the number which is reported at the top of the transcriptInfo.tab
file, which normally seems to match the number of records in the file, does not match for my STAR diploid generated file, but instead is double the number of records in the OriginalGenome/transcriptInfo.tab
(which is what would be expected if all gene models from haplotype 1 were also present in haplotype 2):
$ head OriginalGenome/transcriptInfo.tab
53728
AT1G01010.1 3630 5898 5898 1 6 0 0
AT1G01020.2 6787 8736 5898 2 8 6 1
AT1G01020.6 6787 8736 8736 2 6 14 1
AT1G01020.1 6787 9129 8736 2 9 20 1
AT1G01020.3 6787 9129 9129 2 8 29 1
AT1G01020.4 6787 9129 9129 2 8 37 1
AT1G01020.5 6787 9129 9129 2 9 45 1
AT1G03987.1 11100 11371 9129 1 1 54 2
AT1G01030.1 11648 13713 11371 2 2 55 3
$ wc -l OriginalGenome/transcriptInfo.tab
53729 OriginalGenome/transcriptInfo.tab
$ head transcriptInfo.tab
107456
AT1G01010.1_h1 3630 5898 5898 1 6 0 0
AT1G01020.2_h1 6787 8736 5898 2 8 6 1
AT1G01020.6_h1 6787 8736 8736 2 6 14 1
AT1G01020.1_h1 6787 9129 8736 2 9 20 1
AT1G01020.3_h1 6787 9129 9129 2 8 29 1
AT1G01020.4_h1 6787 9129 9129 2 8 37 1
AT1G01020.5_h1 6787 9129 9129 2 9 45 1
AT1G03987.1_h1 11100 11371 9129 1 1 54 2
AT1G01030.1_h1 11648 13713 11371 2 2 55 3
$ wc -l transcriptInfo.tab
106925 transcriptInfo.tab
Not sure why this would specifically be a problem for the CB_samTagOut
mode though?
Happy to share my STAR index, VCF file and/or some reads if needed!
Best Matt
I think this discrepancy is the source of the issue: if I manually changed the number at the top of the diploid transcriptInfo.tab
file to match the actual number of records in the file, then STAR
loads the index without any complaint and performs mapping successfully.
Hi Matt:
You have found the bug! Indeed, if the transcripts are completely deleted, the number has to be updated. I will fix it in the next release. I have not tested transformed genome with large deletions, so you may encounter other issues.
So far not encountered any other problems, but will let you know! STAR diploid + STARsolo combo is amazing 🤩
Hi @alexdobin ,
Thanks again for maintaining STAR. I am trying to use STAR to map single cell ATAC reads against a genome generated using the genome parameter
--genomeTransformType Diploid
using the mapping parameter--soloType CB_samTagOut
. However this results in the error:I used the following parameters to generate the genome:
Which runs successfully:
followed by this command to map the reads:
Which produces the error.
I am confident the genome index is not corrupt, because I can use the same index to map single cell RNA reads using the command:
This command seems to have no problem loading the genome and begins mapping normally (although I killed the process once the genome appeared to have been loaded successfully).
Many thanks Matt