alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

STARSolo paired-end data #1934

Open mvanins opened 1 year ago

mvanins commented 1 year ago

Hi,

I am trying to map paired-end single-cell RNA-seq data with STARSolo. The cell barcode and UMI are located on a separate third read (in reality the the UMI and CB are located on different reads, but I have moved them around in a separate pre-processing step so they will be compatible).

Mapping fails with the following error:

EXITING because of FATAL ERROR: read files are not consistent, reached the end of the one before the other one
SOLUTION: Check you your input files: they may be corrupted

but I have verified that the fastq input files are properly formatted (same number of reads in the same order). Additionally, the same command successfully completes with no error with all of the two-read combinations (e.g., R1 and R3, R1 and R2, R2 and R3).

The full command is:

 STAR \
   --readFilesType Fastx \
   --readFilesCommand zcat \
   --outSAMtype BAM Unsorted \
   --quantTranscriptomeBan Singleend \
   --quantMode TranscriptomeSAM GeneCounts \
   --seedSearchStartLmax 10 \
   --alignIntronMax 1000000 \
   --peOverlapNbasesMin 5 \
   --outFilterType BySJout \
   --alignSJoverhangMin 8 \
   --outFilterScoreMin 0 \
   --chimScoreSeparation 10 \
   --chimScoreMin 20 \
   --chimSegmentMin 15 \
   --chimOutType WithinBAM \
   --outFilterMismatchNmax 5 \
   --outFilterMultimapNmax 1 \
   --runThreadN 6 \
   --genomeDir ${starIndex} \
   --outTmpDir ./${library}_tmp \
   --readFilesIn R1.fastq.gz R2.fastq.gz R3.fastq.gz \
   --outFileNamePrefix ${library}_ \
   --outSAMattributes NH HI AS nM NM MD jM jI MC ch GX GN CR UR \
   --soloType CB_samTagOut \
   --soloCBmatchWLtype Exact \
   --soloCBwhitelist ${whitelist} \
   --soloBarcodeMate 0 \
   --soloCBstart 1 \
   --soloCBlen 10 \
   --soloUMIstart 11 \
   --soloUMIlen 10 \
   --soloBarcodeReadLength 0 \
   --soloStrand Forward \
   --soloFeatures Gene \
   --soloCellFilter None

tested with both 2.7.10a and 2.7.11a

Do you have any ideas what I might be able to try?

Thanks, Mike

alexdobin commented 1 year ago

Hi Mike,

I would recommend starting with just one read in each file, and also the most basic parameters.

Shruti-BioCode commented 1 year ago

Hi

I am having a similar issue with bulk RNAseq data paired end data. I am getting the following error

"EXITING because of FATAL ERROR: read files are not consistent, reached the end of the one before the other one
SOLUTION: Check you your input files: they may be corrupted"

I have cross checked using diff that the R1 & R2 are in the same order and have the same number of reads.

My command is

STAR --runThreadN 15 --runMode alignReads --quantMode GeneCounts --genomeDir /analysis/reference/hg38/13042022/genome/ --readFilesIn S1_R1_001.fastq S1_R2_001.fastq --outFileNamePrefix ${alignmentDir}/S1 --outSAMtype BAM SortedByCoordinate
alexdobin commented 1 year ago

This likely indicates an issue with the formatting of the files.

Shruti-BioCode commented 1 year ago

Hi Alex, Thanks for the reply. The fastq was generated directly using bcl2fastq convert post sequencing. So, not sure what could be the problem. Doing head and tail for the file seems to shows no problem with the file.

alexdobin commented 1 year ago

Hi Shruti,

you can try to find the problem by mapping a subset of the reads with --readMapNumber. To increase speed, use --outSAMtype None. In principle, you can use binary search to pinpoint the exact place in the file that has bad formatting.