alexdobin / STAR

RNA-seq aligner
MIT License
1.86k stars 506 forks source link

EXITING because of FATAL ERROR in reads input: quality string length is not equal to sequence length #1732

Open LJK1991 opened 1 year ago

LJK1991 commented 1 year ago

Hello,

I am currently trying to replicate data from this paper using STARsolo. In short they perform a SPLiT-seq experiment on salt water worms. I have downloaded their data through SRAtools from their GEO Downloaded all the other required components, such as the genome and gtf files. I then replicated their read trimming with cutadapt, as mentioned in their paper for cDNA/read1: cutadapt -j 4 -m 60 -q 10 -b AGATCGGAAGAG for cellBC/read2 cutadapt -j 4 -m 94 --trim-n -q 10 -b CTGTCTCTTATA

I then concatenated all read1.fastq.gz to eachother and did the same for read2.fastq.gz creating two files. one consisting all read1 files and the second consisting of all read2 files. e.g. zcat file1.fastq.gz file2.fastq.gz file3.fastq.gz > total.fastq.gz

I then run the following STARsolo command:

STAR --runThreadN 24 --genomeDir /media/draco/lucask/AlgorithmTest/ACME/genome/GenomeDir/ --readFilesIn /media/draco/lucask/AlgorithmTest/ACME/data/ACME_read1.fastq.gz /media/draco/lucask/AlgorithmTest/ACME/data/ACME_read2.fastq.gz --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outSAMattributes NH HI nM AS CR UR CB UB GX GN sS sQ sM --sjdbGTFfile /media/draco/lucask/AlgorithmTest/ACME/genome/combined/Djap_Smid.gtf --soloType CB_UMI_Complex --soloCBposition 0_10_0_17 0_48_0_55 0_86_0_93 --soloUMIposition 0_0_0_9 --soloCBwhitelist /media/draco/lucask/SPLTsq_Round1BC.txt /media/draco/lucask/SPLTsq_Round2BC.txt /media/draco/lucask/SPLTsq_Round3BC.txt --soloCBmatchWLtype EditDist_2 --soloBarcodeReadLength 0 --soloFeatures Gene GeneFull --soloCellReadStats Standard --outTmpDir /media/draco/lucask/tmp/ACME/ --runDirPerm All_RWX

STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Dec 28 14:09:08 ..... started STAR run
Dec 28 14:09:08 ..... loading genome
Dec 28 14:14:51 ..... processing annotations GTF
Dec 28 14:15:08 ..... inserting junctions into the genome indices
Dec 28 14:16:40 ..... started mapping

EXITING because of FATAL ERROR in reads input: quality string length is not equal to sequence length
@SRR11768232.393468811
+
CGATGTGGTTGATGAATGCATGAAA
SOLUTION: fix your fastq file

I was confused and to check what is going on i ran

zgrep -B4 -A8 "@SRR11768232.393468811" ACME_read1.fastq.gz

@SRR11768232.393468810 393468810 length=150
GAATAATAAAATAAGACTTCGGTTTTATTTTGTTGGTTTTCGAAACTGAAGTAATGATTAAAAGAGACTGCCGGGGGCATATGTATGCTGGTGCTAGAGGTGAAATTCTTAGATCATCAGCAGACAGTCTACTGCGAAAGCATTTGCCAA
+
FFFFFFFFFFFFFFF:F:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:F:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF
@SRR11768232.393468811 393468811 length=150
GTCGAATGCTCTAGCCTCTCAAGCACGTGGATTGTATGCGAGTCGTACGCCGATGCGAAACATCGGCCACGCCTGTTCGCCTGGGGAA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF
@SRR11768232.393468813 393468813 length=150
AAATAGTACATTTGTTTTAAAATAGCAAATGCTGACTTCTTAGAGGAATAAATAGCGTTTAGCTAAACGAAAAAAAAAAAAAAATCACGTTCGAATGCTCTGGCCTCTCAAGCACGTGGATTATGCCAGAGTCGTACGCCGATGCGAAAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:,F,FFF:FFFFFFFFFFFFFFFFFFFF:F::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:
@SRR11768232.393468814 393468814 length=150

This shows that the fastq file is ok. When i run zcat ACME_read1.fastq.gz | wc -l i get 3812742300 which can be divided the number by 4, showing that i dont have incomplete reads.

I noticed that the output provided gives a read ID i can find, but the sequence shown i cannot (in the ouput of the previous zgrep). I then ran zgrep -B4 -A4 "CGATGTGGTTGATGAATGCATGAAA" ACME_read1.fastq.gz which turned up nothing

I also checked the read2 file. which has significantly less reads (~180 mil) than read 1. Moreover it does not contain the '@SRR11768232.393468811' read ID. However this is most likely because of the cutadapt being different for both reads. But if i understand STARsolo correctly it does not perform mapping with the second input .fastq file, assuming it is the file containing the cell barcodes and treating it differently and therefore should not be the cause of this problem. please correct me if i am wrong.

When doing some googling and reading fora including some issues on this github i found that there are many programs that can mess up a fastq file. A post suggested that zcat, which i used, could leave a "FILE #" stamp for each file. I ran zgrep -B4 -A4 -i "file" ACME_read1.fastq.gz wich returned nothing.

Finally i saw in that in this and this issue to check the file with hexdump -c so i ran

zgrep -B4 -A4 "@SRR11768232.393468811" ACME_read1.fastq | hexdump -c
0000000   @   S   R   R   1   1   7   6   8   2   3   2   .   3   9   3
0000010   4   6   8   8   1   0       3   9   3   4   6   8   8   1   0
0000020       l   e   n   g   t   h   =   1   5   0  \n   G   A   A   T
0000030   A   A   T   A   A   A   A   T   A   A   G   A   C   T   T   C
0000040   G   G   T   T   T   T   A   T   T   T   T   G   T   T   G   G
0000050   T   T   T   T   C   G   A   A   A   C   T   G   A   A   G   T
0000060   A   A   T   G   A   T   T   A   A   A   A   G   A   G   A   C
0000070   T   G   C   C   G   G   G   G   G   C   A   T   A   T   G   T
0000080   A   T   G   C   T   G   G   T   G   C   T   A   G   A   G   G
0000090   T   G   A   A   A   T   T   C   T   T   A   G   A   T   C   A
00000a0   T   C   A   G   C   A   G   A   C   A   G   T   C   T   A   C
00000b0   T   G   C   G   A   A   A   G   C   A   T   T   T   G   C   C
00000c0   A   A  \n   +  \n   F   F   F   F   F   F   F   F   F   F   F
00000d0   F   F   F   F   :   F   :   F   F   F   F   F   F   F   F   F
00000e0   F   F   F   F   F   F   F   F   F   F   F   F   F   F   F   F
00000f0   F   F   F   F   F   F   F   F   F   F   F   F   F   F   F   :
0000100   F   F   F   :   F   F   F   F   F   F   F   F   F   F   F   F
0000110   F   F   F   F   F   F   F   F   F   F   F   F   F   F   F   F
0000120   F   F   F   :   F   F   F   F   F   F   F   F   F   F   F   F
0000130   :   F   :   F   F   F   F   F   F   F   F   F   F   F   F   F
0000140   F   F   F   F   F   F   F   F   F   F   F   F   F   F   F   F
0000150   F   F   F   F   F   F   F   F   :   F   F  \n   @   S   R   R
0000160   1   1   7   6   8   2   3   2   .   3   9   3   4   6   8   8
0000170   1   1       3   9   3   4   6   8   8   1   1       l   e   n
0000180   g   t   h   =   1   5   0  \n   G   T   C   G   A   A   T   G
0000190   C   T   C   T   A   G   C   C   T   C   T   C   A   A   G   C
00001a0   A   C   G   T   G   G   A   T   T   G   T   A   T   G   C   G
00001b0   A   G   T   C   G   T   A   C   G   C   C   G   A   T   G   C
00001c0   G   A   A   A   C   A   T   C   G   G   C   C   A   C   G   C
00001d0   C   T   G   T   T   C   G   C   C   T   G   G   G   G   A   A
00001e0  \n   +  \n   F   F   F   F   F   F   F   F   F   F   F   F   F
00001f0   F   F   F   F   F   F   F   F   F   F   F   F   F   F   ,   F
0000200   F   F   F   F   F   F   F   F   F   F   F   F   F   F   F   F
*
0000230   F   F   F   F   :   F   F   F   F   F   F  \n   @   S   R   R
0000240   1   1   7   6   8   2   3   2   .   3   9   3   4   6   8   8
0000250   1   3       3   9   3   4   6   8   8   1   3       l   e   n
0000260   g   t   h   =   1   5   0  \n   A   A   A   T   A   G   T   A
0000270   C   A   T   T   T   G   T   T   T   T   A   A   A   A   T   A
0000280   G   C   A   A   A   T   G   C   T   G   A   C   T   T   C   T
0000290   T   A   G   A   G   G   A   A   T   A   A   A   T   A   G   C
00002a0   G   T   T   T   A   G   C   T   A   A   A   C   G   A   A   A
00002b0   A   A   A   A   A   A   A   A   A   A   A   A   T   C   A   C
00002c0   G   T   T   C   G   A   A   T   G   C   T   C   T   G   G   C
00002d0   C   T   C   T   C   A   A   G   C   A   C   G   T   G   G   A
00002e0   T   T   A   T   G   C   C   A   G   A   G   T   C   G   T   A
00002f0   C   G   C   C   G   A   T   G   C   G   A   A   A   C  \n   +
0000300  \n   F   F   F   F   F   F   F   F   F   F   F   F   F   F   F
0000310   F   F   F   F   F   F   F   F   F   F   F   F   F   F   F   F
*
0000330   :   F   F   F   F   F   F   F   F   F   F   F   F   F   F   F
0000340   F   F   F   F   F   F   F   F   F   F   F   F   F   F   F   F
0000350   F   F   F   F   F   :   ,   F   ,   F   F   F   :   F   F   F
0000360   F   F   F   F   F   F   F   F   F   F   F   F   F   F   F   F
0000370   F   :   F   :   :   F   F   F   F   F   F   F   F   F   F   F
0000380   F   F   F   F   F   F   F   F   F   F   F   F   F   F   F   F
0000390   F   F   F   F   F   F   :  \n
0000398

However, at this point it goes beyond my understanding, are the asterisks the cause? they are the only thing that seems off to me. I have also attached the Log.out and Log.progress.out if neccesarry.

Also of note, i ran all of this on a WindowsPowerShell that i use to login through ssh to the server i use which is ubuntu

lsb_release -a
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.6 LTS
Release:        14.04
Codename:       trusty

I hope someone can help or give advice. Thanks in advance

P.S. changed the extension to .txt so i could upload, still opens fine. Log.out.txt Log.progress.out.txt

alexdobin commented 1 year ago

Hi Lucas,

the Read1 and Read2 files supplied to STAR have to be perfectly consistent in the order of reads. You need to find cutadapt parameters that preserves ordering of Read1/Read2. The barcode read should not be trimmed at all.

Cheers Alex

rbarbieri86 commented 1 year ago

Hello, I apologize if I overtake an older post however I have exactly the same error in a different situation. I am trying to align scRNA-seq data downloaded with wget from ArrayExpress (https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-9816?query=E-MTAB-9816). One sample in particular is crashing STARsolo with a similar message:

STAR --runThreadN 8 --genomeDir ~/GencodeM29_star/ --soloType Droplet --soloCBwhitelist 737K-august-2016.txt --soloCellFilter EmptyDrops_CR --soloCBstart 1 --soloCBlen 16 --soloUMIstart 17 --soloUMIlen 10 --soloStrand Forward --outSAMattributes NH HI AS nM GX GN CB UB sS sQ sM NM --outSAMtype BAM SortedByCoordinate  --readFilesIn ERR4898571_2.fastq ERR4898571_1.fastq --outFileNamePrefix starsolo_out/ERR4898571/ERR4898571_
        STAR --runThreadN 8 --genomeDir /q/home/barbieri/GencodeM29_star/ --soloType Droplet --soloCBwhitelist 737K-august-2016.txt --soloCellFilter EmptyDrops_CR --soloCBstart 1 --soloCBlen 16 --soloUMIstart 17 --soloUMIlen 10 --soloStrand Forward --outSAMattributes NH HI AS nM GX GN CB UB sS sQ sM NM --outSAMtype BAM SortedByCoordinate --readFilesIn ERR4898571_2.fastq ERR4898571_1.fastq --outFileNamePrefix starsolo_out/ERR4898571/ERR4898571_
        STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Sep 04 15:50:05 ..... started STAR run
Sep 04 15:50:33 ..... loading genome
Sep 04 15:50:44 ..... started mapping

EXITING because of FATAL ERROR in reads input: quality string length is not equal to sequence length
@ERR4898571.78693729
@ERR4898571.78665272 K03021:12603/BBCAACTAGGTAGCACGAACTATTATAC
@ERR4898571.78665271 K003021:12638/BBAACTCCCTCTGCAGTACTGGAATTTT
SOLUTION: fix your fastq file

However a quick check on the file itself, unzipped, doesn't reveal any difference in length between sequence and quality:

@ERR4898571.78693729 K00296:368:H32WYBBXY:2:1124:28777:35110/2
GGAGGCTTACTAAGTGTTCTGCCGGCCTTGTAGAGTTGGAGAGTGTTTAAATAACGTCTAGGGTCTACAGTAAACGTTTGGTAAGTTTGAGGACGGTGGGATCCTCTCCCCAAAGTGAGAACTACCAAGGCCCCCTAG
+
AA-<AJAFJFF-7FJJJJJJAFFFJJ<JFAFAF7<AJJFAFJ<AFAFFFFJJFJJJJJJJJJF7-FA7FJJFFFJFJFJJF7<FJJJJJJF-JFJFFA<FFFJJJJJJJ<AFAFF<FFFA<AFJFA7FFFA-7<FJJF

I have unzipped the two paired fastq files to exclude any gzip-related issue. Any idea on why this occurs? Should I suspect the files got somehow corrupted when I downloaded them?

Thanks in advance!

alexdobin commented 1 year ago

Hi @rbarbieri86

it is a formatting problem somewhere near the reads named @ERR4898571.78693729 @ERR4898571.78665272 @ERR4898571.78665271 You need to check both Read1 and Read2 files.

rbarbieri86 commented 1 year ago

Hi Alex, yes that is the plan. I have noticed a different error in another files from the same ArrayExpress dataset, it looks like there is something wrong there. Do you perhaps have experience with corrupted Fastq files from the ENA archive?

alexdobin commented 1 year ago

Hi @rbarbieri86

No, I do not remember encountering any specific issues with ENA.

rbarbieri86 commented 1 year ago

Hi Alex, sorry for the late reply.

I have indeed confirmed the problem was scrambled lines in File 2 (the sequenced reads). I have re-dowloaded the files from ENA using non-Linux based download (the standard Windows one) and got the correct files. I am wondering what happened there, though there may be a few reasons I can imagine, including using a VPN to connect.

Thanks for your feedback anyway.