alexdobin / STAR

RNA-seq aligner
MIT License
1.86k stars 506 forks source link

Beyond the chromosome bug #1548

Open fengchuiguo1994 opened 2 years ago

fengchuiguo1994 commented 2 years ago

Hi, I recently ran into a very rare bug where I added U1,U2,U3,U4,U5,U6 and other snoRNAs to the genome and then defaulted the parameters to build the index. Then map my data on it. Then the program ran through without any error. But when I open the bam file with samtools view, it reports an error. I went to check and found that there is a reads that it matches to U12, but the starting site of the match is 262143. but the total length of U12 is 157

build genome index: STAR --runMode genomeGenerate --genomeDir genomeindex --genomeFastaFiles genome.fa --sjdbGTFfile genome.gtf --runThreadN 10

samtools index genome: cat genome.forRICseq.fa.fai | grep U12 U12 157 30540 157 158

mapping: STAR --runMode alignReads --genomeDir ${star_genome_index} --readFilesIn single.fastq.gz --readFilesCommand zcat --runThreadN ${threads} --outFileNamePrefix single.genome --outFilterMultimapNmax 100 --alignIntronMin 1 --scoreGapNoncan -4 --scoreGapATAC -4 --chimSegmentMin 15 --chimJunctionOverhangMin 15 --alignSJoverhangMin 15 --alignSJDBoverhangMin 10 --alignSJstitchMismatchNmax 5 -1 5 5 --chimOutType SeparateSAMold --outSAMattributes All --outSAMtype BAM Unsorted --outSAMunmapped Within --limitOutSJcollapsed 32000000 --limitIObufferSize 1200000000 --outFilterMatchNminOverLread 0.5 --outFilterScoreMinOverLread 0.5

open bamfile (other reads is ok) samtools view single.genome.Aligned.out.bam | grep A00129:1011:HKV3GDSX2:2:2351:2962:36432 [E::sam_format1] Corrupted aux data for read A00129:1011:HKV3GDSX2:2:2351:2962:36432 samtools view: writing to standard output failed: Invalid argument

use python script ( pysam ) to find the reads: aa.query_name
Out[43]: 'A00129:1011:HKV3GDSX2:2:2351:2962:36432' aa.reference_name
Out[44]: 'U12' aa.reference_start
Out[45]: 262143 aa.cigarstring
Out[46]: '62S86M2S'

I have 7 data has the bug (total: 78). If possible, I can send my data and genome to you.

fengchuiguo1994 commented 2 years ago

After my attempts, I found that it was probably because of the genomeSAindexNbases parameter, and the bug recurred with the default value of 14 (my genome length is about 400,000,000. MH63, log2(400000000)/2 - 1=13.28771 ). After I used bwa to determine the correct position for the reads. Then I regenerated the genome index with --genomeSAindexNbases 6 and then map the reads again and got the same result as bwa.

I don't understand how the genomeSAindexNbases parameter affects the mapping results, but at least now it looks like I'm getting the correct mapping results.

alexdobin commented 2 years ago

Hi @fengchuiguo1994

what version of STAR are you using? The newest versions fixed some of the issues with --genomeSAindexNbases . However, reducing its value is the only way to solve the issue in some cases.

fengchuiguo1994 commented 2 years ago

Hi: the version is: 2.6.1a

STAR -h Usage: STAR [options]... --genomeDir REFERENCE --readFilesIn R1.fq R2.fq Spliced Transcripts Alignment to a Reference (c) Alexander Dobin, 2009-2015

versions

versionSTAR 020201 int>0: STAR release numeric ID. Please do not change this value! versionGenome 020101 020200 int>0: oldest value of the Genome version compatible with this STAR release. Please do not change this value!

alexdobin commented 2 years ago

Hi @fengchuiguo1994 this is a very old version. Please switch to 2.7.9a. The newer versions fixed some of the issues with --genomeSAindexNbases .

fengchuiguo1994 commented 2 years ago

Hi: Is genomeSAindexNbases still important in 2.7.9a like in 2.6. Does the bug go away with default parameters. I'll change the software version for the next project.

alexdobin commented 2 years ago

Hi @fengchuiguo1994

it should go away with default parameters, but it's still recommended to scale it down for efficiency reasons.