dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
70 stars 39 forks source link

Multiple Step 2 errors #423

Closed aveeva closed 3 years ago

aveeva commented 3 years ago

Hello,

after executing Step 2, ipyrad returned errors for 31 samples. The problems are different. I have no idea how to fix any of them, as I am fairly new to this. I would appreciate your help very very much!

I am working with 3 plates of samples. I demultiplexed plates 1 and 3 in ipyrad, while plate 2 was demultiplexed in Stacks (due to a corruption in the file which made Step 1 fail in ipyrad) and then imported into ipyrad as pre-demultiplexed sequences. I merged samples of all three plates together and ran Step 2.

The error log reports 31 failed samples with different error messages. I'll name them to avoid confusion. Error 1: quality sequence length and read length do not match (1 sample, plate 2) Error 2: Line 3 in FASTQ file is expected to start with '+', but found [...] (1 sample, plate 1) Error 3: Line 1 in FASTQ file is expected to start with '@', but found [...] (29 samples, plate 1)

I noticed that samples from different plates behaved very differently during step 2. All samples from plate 3 passed Step 2 with a small percentage of reads filtered out. All samples from plate 2 passed Step 2 with ALL reads passing the filter, except for the one sample that got Error 1. Finally, for plate 1, some samples passed Step 2 with minor filtering, but all the rest failed due to Error 2 and Error 3.

I checked the first few lines of the problematic samples and they all have the '@' and '+' in appropriate lines. Considering all but one "faulty" samples came from the same plate, I can't help but think this is somehow connected to demultiplexing. Could this be the case? Additionally, it may help to know I currently have the [trim_reads] parameter set to 0,0,0,0. Is this even correct in my case?

Thank you in advance!

isaacovercast commented 3 years ago

Hello there. These all look like data formatting issues.

Error 1: This due to a malformed fastq file, it was probably truncated, and probably is related tot he corruption which made step 1 fail. If the raw data for plate 2 is corrupted and stacks happily demultiplexes it this is bad, as you will only see this error cascade forward. I would try to download a new clean and uncorrupted copy of the plate 2 data and demux it in ipyrad.

Error 2: Is there more information about what was found besides "[...]"? This is also some kind of file corruption. Any more detailed information would be useful.

Error 3: These could also be corrupted. Looking at the first few lines is a good idea, but often isn't enough because the corruption can happen anywhere, but typically is at the end. My suspicion is that you are out of disk space and that these files got truncated. Can you verify that you have plenty of disk space?

If you want to wetransfer me one of the corrupt samples from error 2 and error 3 i will verify this.

aveeva commented 3 years ago

Thank you very much for a quick reply and apologies for mine being late. The reason is I am waiting for the lab that originally had these plate sequenced (and has the raw data) to check what the problem might be. I warned them about the sample that you suggest is truncated and suggested to repeat demultiplexing. As for the disk space, I am running the analyses on a university server, so I imagine there must be enough space, but I am not entirely sure and will double check.

Regarding the information I replaced with [...], here are a few examples of the full error messages for Errors 2 and 3:

cutadapt: error: Line 1 in FASTQ file is expected to start with '@', but found 'FFF4:AAG\n' cutadapt: error: Line 1 in FASTQ file is expected to start with '@', but found 'F2CTFFF8F8' cutadapt: error: Line 3 in FASTQ file is expected to start with '+', but found 'CFFFFF:FAT'

Hope this helps. I'll be sure to send you the samples if the lab doesn't manage to resolve it.

isaacovercast commented 3 years ago

The server may have space, but you may have disk quotas. I still believe these are truncated or malformed fastq files.

isaacovercast commented 3 years ago

I assume this is resolved since it's been a while.