CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

Error: parsing fastq file #502

Closed CWYuan08 closed 2 years ago

CWYuan08 commented 2 years ago

Hi, I tried to run umi_tools extract, but I got the issue " U.error("parsing error: expected '@' in line %s" % line1)". I checked my line 1, it does start with "@", could you please help me on this?Thank you very much!

IanSudbery commented 2 years ago

Can you post the full command you used and the first few lines of the fastq file(s)?

CWYuan08 commented 2 years ago

Can you post the full command you used and the first few lines of the fastq file(s)?

Thank you for prompt reply!

I used: umi_tools extract --bc-pattern=CCCCCCCCNNNNNNNN \ --stdin SRR4199344_1.fastq.gz \ --stdout SRR4199344.R1.extracted.fq.gz \ --read2-stdout \ --read2-in SRR4199344_2.fastq.gz \ --filter-cell-barcode --whitelist=barcodes.fill.short.txt

and the input:

@SRR4199344.1 1 length=150 ATGCCGAAGCCCCCCCATGAAAAAAATACTTTTCTTTTTTTTTCTTTTTCTTCTTTTATGTATTTTTGTTTTTTATTCTTTTTTGTCTTTATTTGTTTATTTATATTTTCATTTTCTTTATTCTCTTATTTCTCATTTATATTTTATTTA +SRR4199344.1 1 length=150 AAAFF-F---7-7--77--------<----<----<------<-77---7AF--7-7--------7-----7-7-------7-7------------7--77--7----77--7--7----7--7--7-----7----------------- @SRR4199344.2 2 length=150 ACAGCAGACGGCCGCCTTATTCTTTTACAATTTTTTTTTTTTTATTAATAATCCTTGGGTTCTCCGCACAGAGGGGGATCGGGCAGGGTCAGGAGACAAGAGGGGGGGGAAGGACAGCAAAAAAAAAAGTAAACAAAGCTCTCGGGTTCA +SRR4199344.2 2 length=150 AAFFFJJA---77-------------------------<<F-<A77-7-7---7---)7--)7-7)-<))7--<7---7))))))<-))77)7-)--<<-AF-7))---))777-<)7<7--7---7-----7--<7-----)))-7<)-

IanSudbery commented 2 years ago

There should be a line in the error message underneath U.error("parsing error: expected '@' in line %s" % line1) - line1 here isn't referring to the first line of your input file, but the current line for file 1.

You should have and output under this with the line number on it. One way this error might be caused is if you have an empty line at the end of your fastq file.

CWYuan08 commented 2 years ago

Hi I checked the end of my file and it isn't empty, should I search through the whole file? Many thanks

CWYuan08 commented 2 years ago

Hi I tried to used bbmap to remove empty reads,

it looks like it doesn't have empty read: Input: 49833473 reads 7475020950 bases Short Read Discards: 0 reads (0.00%) 0 bases (0.00%) Output: 49833473 reads (100.00%) 7475020950 bases (100.00%)

could you please advise other ways to check this?

Thank you very much!

IanSudbery commented 2 years ago

You could try

zcat  SRR4199344_1.fastq.gz | awk 'NR % 4 == 1' | grep -v "^@" | wc -l

Should show the read names of any reads that don't have "@" at the start.

You might also check that the output of

zcat  SRR4199344_1.fastq.gz | wc -l

divides exactly by 4.

CWYuan08 commented 2 years ago

Thank you, I tried zcat SRR4199344_1.fastq.gz | awk 'NR % 4 == 1' | grep -v "^@" | wc -l gives 0

and zcat SRR4199344_1.fastq.gz | wc -l gives 199333892, which can be divided by 4.

I am still not sure why there is an error..

Thank you again

IanSudbery commented 2 years ago

I have downloaded SRR4199344 and am trying to run the analysis myself. I'll let you know what answer I come to.

IanSudbery commented 2 years ago

Seems to run fine for me, but I was doing it without a whitelist. What options were you using for creating the whitelist?

CWYuan08 commented 2 years ago

Thank you very much!

I have attached my whitelist. Do you mind sharing your command with me? barcodes.fill.short.txt

Best, CW

IanSudbery commented 2 years ago

Okay, I just ran with your whitelist, and it worked fine without any error. Is it possible that your input files are corrupted in some way?

The MD5 of the files I'm using are:

$ md5sum SRR4199344*gz
9115da2f11b8e9347b74c3862f48ccf0  SRR4199344_1.fastq.gz
964dbcaf93cb129832aae250627a033a  SRR4199344_2.fastq.gz

I downloaded these from ENA.

BTW the command I used was:

umi_tools extract --bc-pattern=CCCCCCCCNNNNNNNN --stdin SRR4199344_1.fastq.gz --stdout SRR4199344.R1.extracted.fq.gz --read2-stdout --read2-in SRR4199344_2.fastq.gz --filter-cell-barcode --whitelist=barcodes.fill.short.txt
CWYuan08 commented 2 years ago

Dear Ian,

many thanks, I tried to redownload my input files, this time the command did run with no errors, and it parsed INFO Input Reads: 49833473 and INFO Filtered cell barcode: 49833473 but my SRR4199344.R1.extracted.fq.gz looks empty, do you know why this is? Should I drop the whitelist?

Thank you again and happy new year!

Best, CW

IanSudbery commented 2 years ago

Sorry, I never looked inside your barcode white list. Where did you find it? Its not formatted correctly. You can see the format here https://umi-tools.readthedocs.io/en/latest/reference/whitelist.html, everything after the first column is optional. Basically, the whitelisted barcodes need to be in the first column.

CWYuan08 commented 2 years ago

Dear Ian,

thank you very much! Do I have to run this whitelist command to generate the file? I know the barcodes, could I just have 1 column (the barcodes) and then run the previous command again?

Many thanks CW

IanSudbery commented 2 years ago

Yes, you can just have a 1 column file with the barcodes and run the extract command again.