alexdobin / STAR

RNA-seq aligner
MIT License
1.84k stars 505 forks source link

number of bytes expected from the BAM bin does not agree with the actual size on disk #1988

Open madeleineaaseremedios opened 11 months ago

madeleineaaseremedios commented 11 months ago

Hi, I am mapping several transcriptome samples to the same genome using a batch script on slurm:

#!/bin/bash
#SBATCH -t 02-00:00
#SBATCH -c 12
#SBATCH --mem 16G
#SBATCH --job-name STAR_align
#SBATCH -o %j.out
#SBATCH -e %j.err

module load bioinformatics
module load star/2.7.11a

indir=$1
indexdir=$2

for FILE1 in $indir/*_1.fq.gz

do
FILE2=${FILE1/_1.fq/_2.fq}
FILE3=${FILE2/_2.fq/_result}

STAR --runThreadN 12 \
--readFilesIn $FILE1 $FILE2 \
--genomeDir $indexdir \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix $FILE3 \
--outSAMunmapped Within \
--readFilesCommand gunzip -c \
--quantMode GeneCounts

done

the first two pairs of files:

RED3_EYE_1_L2_1.fq.gz
RED3_EYE_1_L2_2.fq.gz
RED3_EYE_1_L2_result.gzAligned.sortedByCoord.out.bam
RED3_EYE_1_L2_result.gzLog.out
RED3_EYE_1_L2_result.gzLog.progress.out
RED3_EYE_1_L2_result.gzReadsPerGene.out.tab
RED3_EYE_1_L2_result.gzSJ.out.tab
RED3_EYE_1_L2_result.gz_STARtmp
RED3_EYE_2_L3_1.fq.gz
RED3_EYE_2_L3_2.fq.gz
RED3_EYE_2_L3_result.gzAligned.sortedByCoord.out.bam
RED3_EYE_2_L3_result.gzLog.out
RED3_EYE_2_L3_result.gzLog.progress.out
RED3_EYE_2_L3_result.gz_STARtmp

I got this error message regarding the first pair (RED3_EYE_1):

EXITING because of FATAL ERROR: number of bytes expected from the BAM bin does not agree with the actual size on disk: Expected bin size=245997338 ; size on disk=0 ; bin number=47 This is the tail of the RED3_EYE_1_L2_result.gzLog.out file:

Created thread # 8
Created thread # 9
Created thread # 10
Created thread # 11
Starting to map file # 0
mate 1:   RED3/RED3_EYE_1_EKRN230048993-1A_HMTYJDSX7_L3_1.fq.gz
mate 2:   RED3/RED3_EYE_1_EKRN230048993-1A_HMTYJDSX7_L3_2.fq.gz
BAM sorting: 147564 mapped reads
BAM sorting bins genomic start loci:
1   0   2813034

But also, it seems to have continued onto the second pair of files even though it encountered a fatal error? At least I think so because it made some output files/folders for the second pair, and these are absent for the rest of the pairs of fastq files in this folder.

From looking at other questions/discussions, it seems that I should increase the memory request (my genome is smaller than mammals so i thought the upper limit for mammal genomes would suffice) and/or change the bam output to unsorted. Each fastq file is about 20-60 million reads and the genome size is 167964029 bytes for this species.

As a side note, I would like to speed up my job if possible, as my HPC has a limit of 7 days, so any tips would be appreciated!!

Thanks for your help!

alexdobin commented 10 months ago

Hi @madeleineaaseremedios

The likely problem is the lack of disk space. To speed up the computation, I would recommend submitting each mapping job independently to the cluster.