Closed faraz89 closed 6 years ago
Hi Faraz,
Hi thackl,
Many thanks for answering my query. So any merge tool will work, for example just "cat" all the chunk files together? or there is some specific tool i need to use?
Yes. It's just plain fastq/fasta files, you can simply paste them together.
On 11/07/2017 09:02 AM, Faraz Khan wrote:
Hi thackl,
Many thanks for answering my query. So any merge tool will work, for example just "cat" all the chunk files together? or there is some specific tool i need to use?
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/BioInf-Wuerzburg/proovread/issues/113#issuecomment-342490991, or mute the thread https://github.com/notifications/unsubscribe-auth/AI7ejxUCddtS3sazlym9frrIQOYk5v_Wks5s0GLygaJpZM4QUoEf.
Much appreciated. You are a star!
Hi Thomas, One thing that is making me crazy. Please help.
So I have 4 subreads file. 1 per smrt cell.
Is that it? Any help would be appreciated.
Faraz.
SeqChunker does not change the content of the file, it just cuts it into smaller pieces. If you create a single chunk with the size of the entire subread file, you just making a copy of the file. So if your smrt cell files are >1GB in size, and you would are OK with chunks of that size, you don't need to chunk them at all. Just run proovread on each of the smrt cell files, one by one.
If you would want smaller files, you would run something like
SeqChunker -s 100M -o subreads_chunk-%03.fq subreads.fq
# creates files: subreads_chunk-001.fq, subreads_chunk-002.fq, ...
and that would create several 100MB large pieces from the input file. If you cat them together
cat subreads_chunk-*.fq > subreads_chunks-all.fq
you'd again have your original subreads.fq file. Does that make sense?
Thanks for always replying so swiftly. My files are >5GB. 12X coverage with 775 MB genome size As per the recommendation, i wanted to create chunks and would follow your above steps if that will improve the results.
And that will give one single error corrected reads file, which i will use for gap filling later.
[E::sam_parse1] SEQ and QUAL are of different length [W::sam_read1] parse error at line 25199 [main_samview] truncated file.
Any idea how to solve this?
Cheers.
cat subreads_chunks-*_smrtcell*.fq > All_smrt_cell_subreads.fq
SeqFilter --careful < chunk.fq
, that should validate the file format, and/or indicate where the problem occurs.Many thanks Thomas. I'll check!!!
Hi Thomas,
I am almost near to understand this tool properly and last inputs from you will be highly appreciated. Sorry for all the questions i have been asking.
Faraz.
proovread --coverage 125
, proovread will then internally subsample to the proper minimal coverage for each iteration. There shouldn't be a real difference in speed compared to runs with for example a 88X library.proovread does do high quality correction, but that comes at some performance costs. And at genome sizes close to ~1Gbp, I know that it really can become an issue... If runtime remains a problem, you might want to have a look at Lordec or Jabba instead.
Many thanks Thomas. I really appreciate the kind of effort you give to help others.
Hi Thomas,
I just got the results. The nohup file is attached. Please check. If you notice, the shortest read changed from 492bp (after stubby reads removal) to 98bp (output result). Is it normal?
Also regarding the increase in input sequences in the proovread results - Is the reason of the input sequences to increase in the final output is due to the fact that proovread cuts the sequences into pieces (near chimeric /siameric regions), hence more number of sequences?
So i have two questions: