benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

Couldn't automatically detect the sequence identifier field in the fastq id string. #701

Closed Sebastian-Mynott closed 5 years ago

Sebastian-Mynott commented 5 years ago

Hi,

I'm looking at sequence data downloaded from the NCBI SRA database. When running filterAndTrim I get he following error:

Error in (function (fn, fout, maxN = c(0, 0), truncQ = c(2, 2), truncLen = c(0,  : 
  Couldn't automatically detect the sequence identifier field in the fastq id string.

After looking at the source code I tried inserting a dummy identifier, so instead of the identifier reading @SRR9876543.1 1/1, it would read @M012345:SRR9876543.1 1/1, but this didn't work.

Could you give me a suggestion how I can get around this?

Many thanks.

benjjneb commented 5 years ago

What is the output of head -n4 mysrr_file.fastq (in the shell)?

What command did you use to convert from sra format to fastq? i.e. the fastq-dump arguments.

Sebastian-Mynott commented 5 years ago

Aha! I downloaded the files using package SRAdb getSRAfile(SRAccessions, sra_con, fileType = 'fastq' ) which gave me a list of .fastq.gz files so I didn't think I'd need fast-dump.

the output of head -n4 mysrr_file.fastq gives me this: @SRR7758019.1 1/1 GCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCGGTTAAAAAGCTCGTAGTTGGATTTCTGCTGAGGACGACCGGTCCGCCCTCTNNNNNNNNNTNNNNCTCGGCNTTGGCATCTTCTTGGGGAACGTNANTGCACTTGACTGTGTGGTGCGGTATCCAGGACTTTTACTTTGAGGNNNNNNNNGTGNNNCAANCNGGCTTACGCCTTGAATACATTAGCATGGAATAATAAGATAGGACCTTGGTTCTATTTNNTTGGNNNNNNNNGCTGAGGTNATGATTACTAGGGATAG + CCCCCGGGGGGGGGEGGFGGGGGGGGGGGGGGFGFGGGGFGGGGGGGGGGGGFGDFFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG#########:####::DFGG#:BFGGGGGGGGGGGGGGGFGGG#:#:BFFGGGGGGFGGGGFGGGGGGGGGGGGGG7FGGGGGGGGGGFGG########56>###66=#6#6*;CFCGFGGGGGGFFGGGGGGGGDFG0776CAF7FF?7+??FGG6CC?C5D?GGGG##228*########0--1<CG4#--(4;A>4-5=FF**9*

Do I need to download the files again as SRA then convert to fastq?

benjjneb commented 5 years ago

Do I need to download the files again as SRA then convert to fastq?

I would at least try that on one file to see if that fixes this issue.

kelseysumner commented 5 years ago

Hi, I wanted to re-open this because I am having a similar issue. I'm using paired-end sequence data sequenced on Illumina MiSeq and also downloaded from the NCBI SRA database. I downloaded the files originally as SRA files and then converted them to zipped fastq files (fastq.gz) using fastq-dump with a flag to make sure each sample had separate files for the forward and reverse reads.

I'm getting the same error when I run DADA2 on these files: Error in (function (fn, fout, maxN = c(0, 0), truncQ = c(2, 2), truncLen = c(0, : Couldn't automatically detect the sequence identifier field in the fastq id string. Calls: filterAndTrim ... mclapply -> lapply -> FUN -> .mapply -> Execution halted

The head of my one of my fastq files I'm reading into DADA2 looks like this: @SRR1191781.12854 12854 length=250 TTATTAATCCTATTGAACTATTTACGACATTAAACACACTGGAACATTTTTCCATTTTACAAATTTTTTTTTCAATATCATTTGCATAATCTAATTGGTCTTTAGGTTTATTAGCAGAGCCAGGTTTTATTCTAACTTGAATACCATTTCCACAAGTTACACTACATGGGGACCATTCAGTTGAAAGAGAATTTTGTATTGTCTTTAAATATTTTTCTATGTGCT + HHHHHHHHHHHFHHHHHHHHHHHHHGGFEFFHHHHHHHGHHHHHHHHHHHHHHHHHHHHHH5FGHHHGG>EGHHHHHHHHHHHGHBHHFHHHGDGHHHHHGGHHHHGHHFHHHGHFBFEGHFHH2BFGHGGHHHHHHHGGHHHHHHHHHHHGHHHHG1GHFHHGHHHHHEGGGGHHHHHGGHFHHBGGBCGHHHFHGGHGHFFHHHHHHGHHHHFGGGGGGFFGF

Do you know what might be going on and how I could fix this issue?

benjjneb commented 5 years ago

This error is because the original fastq id lines have been replaced by these SRA id lines, which filterAndTrim(..., matchIDs=TRUE) doesn't recognize.

Do you need to use the matchIDs=TRUE flag? If you don't, just remove it and everything should work fine.

kelseysumner commented 5 years ago

Thank you for the quick reply. It looks like that solved the issue!

d-callan commented 5 years ago

I'm having a similar problem with the SRA id lines, except i do require the matchIDs = TRUE flag. What then?

benjjneb commented 5 years ago

@d-callan Unfortunately I'm not sure if there is a solutions in that case. The original IDs are required to match the paired reads together if they are now in different orders.

d-callan commented 5 years ago

thanks anyhow. I'm not convinced they are truly ordered differently. but im finding there are definitely differing number of read counts for forward and reverse. perhaps i can put together a script quickly to remove those reads which dont have a partner before passing to dada2 and see where that gets me. was mostly just hoping i might not have to..

dbro970 commented 1 year ago

thanks anyhow. I'm not convinced they are truly ordered differently. but I'm finding there are definitely differing number of read counts for forward and reverse. perhaps I can put together a script quickly to remove those reads which don't have a partner before passing to dada2 and see where that gets me. was mostly just hoping I might not have to..

Hi apologies for resurrecting an old thread, I was just wondering if you managed to find a solution to this? as I've found myself in the same situation

wygbio commented 10 months ago

I am also meeting a similar issue with the head of fastq files. They were obtained by Illumina MiSeq, not downloaded from the NCBI SRA database. Is there something wrong with the head that can't be detected? @HWI-D00433:728:HHHKHBCX2:2:1101:8032:2352.1:N:0--D13a_C25.