Open LJK1991 opened 1 year ago
Hi Lucas,
the Read1 and Read2 files supplied to STAR have to be perfectly consistent in the order of reads. You need to find cutadapt parameters that preserves ordering of Read1/Read2. The barcode read should not be trimmed at all.
Cheers Alex
Hello, I apologize if I overtake an older post however I have exactly the same error in a different situation. I am trying to align scRNA-seq data downloaded with wget from ArrayExpress (https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-9816?query=E-MTAB-9816). One sample in particular is crashing STARsolo with a similar message:
STAR --runThreadN 8 --genomeDir ~/GencodeM29_star/ --soloType Droplet --soloCBwhitelist 737K-august-2016.txt --soloCellFilter EmptyDrops_CR --soloCBstart 1 --soloCBlen 16 --soloUMIstart 17 --soloUMIlen 10 --soloStrand Forward --outSAMattributes NH HI AS nM GX GN CB UB sS sQ sM NM --outSAMtype BAM SortedByCoordinate --readFilesIn ERR4898571_2.fastq ERR4898571_1.fastq --outFileNamePrefix starsolo_out/ERR4898571/ERR4898571_
STAR --runThreadN 8 --genomeDir /q/home/barbieri/GencodeM29_star/ --soloType Droplet --soloCBwhitelist 737K-august-2016.txt --soloCellFilter EmptyDrops_CR --soloCBstart 1 --soloCBlen 16 --soloUMIstart 17 --soloUMIlen 10 --soloStrand Forward --outSAMattributes NH HI AS nM GX GN CB UB sS sQ sM NM --outSAMtype BAM SortedByCoordinate --readFilesIn ERR4898571_2.fastq ERR4898571_1.fastq --outFileNamePrefix starsolo_out/ERR4898571/ERR4898571_
STAR version: 2.7.10a compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Sep 04 15:50:05 ..... started STAR run
Sep 04 15:50:33 ..... loading genome
Sep 04 15:50:44 ..... started mapping
EXITING because of FATAL ERROR in reads input: quality string length is not equal to sequence length
@ERR4898571.78693729
@ERR4898571.78665272 K03021:12603/BBCAACTAGGTAGCACGAACTATTATAC
@ERR4898571.78665271 K003021:12638/BBAACTCCCTCTGCAGTACTGGAATTTT
SOLUTION: fix your fastq file
However a quick check on the file itself, unzipped, doesn't reveal any difference in length between sequence and quality:
@ERR4898571.78693729 K00296:368:H32WYBBXY:2:1124:28777:35110/2
GGAGGCTTACTAAGTGTTCTGCCGGCCTTGTAGAGTTGGAGAGTGTTTAAATAACGTCTAGGGTCTACAGTAAACGTTTGGTAAGTTTGAGGACGGTGGGATCCTCTCCCCAAAGTGAGAACTACCAAGGCCCCCTAG
+
AA-<AJAFJFF-7FJJJJJJAFFFJJ<JFAFAF7<AJJFAFJ<AFAFFFFJJFJJJJJJJJJF7-FA7FJJFFFJFJFJJF7<FJJJJJJF-JFJFFA<FFFJJJJJJJ<AFAFF<FFFA<AFJFA7FFFA-7<FJJF
I have unzipped the two paired fastq files to exclude any gzip-related issue. Any idea on why this occurs? Should I suspect the files got somehow corrupted when I downloaded them?
Thanks in advance!
Hi @rbarbieri86
it is a formatting problem somewhere near the reads named @ERR4898571.78693729 @ERR4898571.78665272 @ERR4898571.78665271 You need to check both Read1 and Read2 files.
Hi Alex, yes that is the plan. I have noticed a different error in another files from the same ArrayExpress dataset, it looks like there is something wrong there. Do you perhaps have experience with corrupted Fastq files from the ENA archive?
Hi @rbarbieri86
No, I do not remember encountering any specific issues with ENA.
Hi Alex, sorry for the late reply.
I have indeed confirmed the problem was scrambled lines in File 2 (the sequenced reads). I have re-dowloaded the files from ENA using non-Linux based download (the standard Windows one) and got the correct files. I am wondering what happened there, though there may be a few reasons I can imagine, including using a VPN to connect.
Thanks for your feedback anyway.
Hello,
I am currently trying to replicate data from this paper using STARsolo. In short they perform a SPLiT-seq experiment on salt water worms. I have downloaded their data through SRAtools from their GEO Downloaded all the other required components, such as the genome and gtf files. I then replicated their read trimming with cutadapt, as mentioned in their paper for cDNA/read1:
cutadapt -j 4 -m 60 -q 10 -b AGATCGGAAGAG
for cellBC/read2cutadapt -j 4 -m 94 --trim-n -q 10 -b CTGTCTCTTATA
I then concatenated all read1.fastq.gz to eachother and did the same for read2.fastq.gz creating two files. one consisting all read1 files and the second consisting of all read2 files. e.g.
zcat file1.fastq.gz file2.fastq.gz file3.fastq.gz > total.fastq.gz
I then run the following STARsolo command:
I was confused and to check what is going on i ran
This shows that the fastq file is ok. When i run
zcat ACME_read1.fastq.gz | wc -l
i get 3812742300 which can be divided the number by 4, showing that i dont have incomplete reads.I noticed that the output provided gives a read ID i can find, but the sequence shown i cannot (in the ouput of the previous zgrep). I then ran
zgrep -B4 -A4 "CGATGTGGTTGATGAATGCATGAAA" ACME_read1.fastq.gz
which turned up nothingI also checked the read2 file. which has significantly less reads (~180 mil) than read 1. Moreover it does not contain the '@SRR11768232.393468811' read ID. However this is most likely because of the cutadapt being different for both reads. But if i understand STARsolo correctly it does not perform mapping with the second input .fastq file, assuming it is the file containing the cell barcodes and treating it differently and therefore should not be the cause of this problem. please correct me if i am wrong.
When doing some googling and reading fora including some issues on this github i found that there are many programs that can mess up a fastq file. A post suggested that zcat, which i used, could leave a "FILE #" stamp for each file. I ran
zgrep -B4 -A4 -i "file" ACME_read1.fastq.gz
wich returned nothing.Finally i saw in that in this and this issue to check the file with hexdump -c so i ran
However, at this point it goes beyond my understanding, are the asterisks the cause? they are the only thing that seems off to me. I have also attached the Log.out and Log.progress.out if neccesarry.
Also of note, i ran all of this on a WindowsPowerShell that i use to login through ssh to the server i use which is ubuntu
I hope someone can help or give advice. Thanks in advance
P.S. changed the extension to .txt so i could upload, still opens fine. Log.out.txt Log.progress.out.txt