BioInf-Wuerzburg / proovread

PacBio hybrid error correction through iterative short read consensus
MIT License
60 stars 20 forks source link

[error] "Quality trimming and siamaera filtering raw output" step #90

Closed migrau closed 7 years ago

migrau commented 7 years ago

Hi thackl,

I am trying to correct pacbio data (~20x, ~130000 reads). I trimmed the file with SeqChunker, obtaining 55 files of 2500 reads each. For each file, I am using an illumina fastq file, ~100x coverage and ~130 million of reads.

All runs finished but some ones (half of them) have a warning error in the "Quality trimming and siamaera filtering raw output" step: BLAST engine error: Warning: Sequence contains no data [siamaera] Blast exited with error: 768 In these cases, the outputs *.trimmed.fq contains a lower number of reads than the uncorrected fastq (half or less).

I don't understand why it only happens sometimes, since the other runs seems to finish correctly. Is it normal?

I am using proovread-2.13.13 under Perl 5.10.1, ncbi-blast-2.2.29 and samtools 1.3.1: _proovread -l file1.fq -s illumina.fastq --pre Ar2/file1 -t 2 --coverage=100

thackl commented 7 years ago

Seems like this is a recent problem, since Celine also ran into it yesterday (#91). So let's hope we can get to the bottom of this quickly. It would be ideal if you could share one of the failed chunks (the .untrimmed.fq and the .chim.tsv file) so I can try and reproduce the error locally? Is that possible?

CelineReisser commented 7 years ago

Hey Thomas, I created a repository with all the required info and files you asked, and sent you an invite link.

Let me know if you need anything else!

Cheers

Céline

migrau commented 7 years ago

Hi thackl, Céline,

In my case, problem was solved re-running the tasks. I was wondering if maybe there is a max. number of jobs to run at the same type. My original run sent to the cluster ~ [200-400] proovread jobs with 4 threads each one (I have multiple pacbio data to correct). After get the error with half of them, I re-run them but in groups of 5 jobs, with 24 threads each one. In this case, I didn't get any error, all jobs finished correctly.

Regards,

Miquel

CelineReisser commented 7 years ago

on my side the problem is still here, and I ran only one job, so it cannot be attributed to a max number of jobs... Have a good day. C.

thackl commented 7 years ago

@migrau that is an interesting observation - though I have no idea what could cause this behaviour. Otherwise I'm happy to hear that it now works for you.

@cmor2207 thanks a lot for the data, and I'm really sorry for the delay in replying. I am currently traveling, with limited access to wifi etc.. I will be back home at Thursday, and have a look at your data first thing.

Cheers T.

CelineReisser commented 7 years ago

Hey Thomas,

Coming to the news to know if you had a chance to look at the files. I reran the pipeline on smaller pacbio files (40Mb instead of 1Gb) and it seems to go till the end without any problem... This info may help you.

Cheers

Céline

thackl commented 7 years ago

Hi Céline,

yes, I did run your data, but unfortunately couldn't reproduce the error. I tried a few things (changed machines/environments) but that did not help. So right now, I'm a bit puzzled, especially since smaller chunks seem to work for you. Quite frankly, I'm not really sure how best to proceed. One thing of course would be for you to try an intermediate chunk size, e.g. 300MB, and see if that works and would also work for your entire data set?

Cheers Thomas

CelineReisser commented 7 years ago

Hey Thomas, Yes that is what I am currently doing. I reran the piopeline with a 100mb, it went through smoothly. I started a 400mb & I should have the results by tomorrow. I will reload the 1Gb run today. I will let you know how it goes, but this is odd indeed.

CelineReisser commented 7 years ago

Hey Thomas,

So I came across the error again, but it looks like it is random and only happenning in some files. I now have the untrimmed file at a correct size (833Mb), and the "trimmed" file still running low at 267Mb while the log file tells me the % of masked reads in the sr-finish was of 84%

The problem seems to happen with seqfilter, and the error (same as above) occurs at 291Mb in the file. So my question is: since I have the ".chim" file, and the ".untrimmed" file, could I generate the correct ".trimmed" file by calling seqfilter myself? And what would be the procedure?

Have a good day. Cheers

Céline

CelineReisser commented 7 years ago

never mind ;) I found the seqfilter command line in the log file, I ran it using the output from the proovread run (untrimmed and chim) and it works perfectly!

Cheers.

C

thackl commented 7 years ago

Hi Céline,

sorry, your first comment somehow slipped through the cracks. Glad to hear that you found a solution!

Hulanyue commented 4 years ago

Hi,@cmor2207 ,I also have this problem

[23:54:45] /home/workstation/biosoft/proovread/bin/SeqFilter-1.06 [23:54:45] --in: pb-2/pb-2.untrimmed.fq [23:54:45] Detected FASTQ format, phred-offset 33 [23:54:45] --substr-file: pb-2/pb-2.chim.tsv 0 [ ]BLAST engine error: Warning: Sequence contains no data [19-11-17 23:54:47] [siamaera] Blast exited with error: 768

Can you tell me the details about your solution? Thanks!