BioInf-Wuerzburg / proovread

PacBio hybrid error correction through iterative short read consensus
MIT License
60 stars 20 forks source link

Chunking and last output #113

Closed faraz89 closed 6 years ago

faraz89 commented 6 years ago

So i have two questions:

  1. If my genome is >500Mbp and i have high computational run power, can i just run proovread without chunking the PacBio long reads using SeqChunker?
  2. In case where i am chunking the PacBio reads, do i need to merge all the chunk files after they are trimmed to create one single ''error corrected PacBio reads'' file?
thackl commented 6 years ago

Hi Faraz,

  1. computationally, that should be no problem. But make sure not to have more than ~1X genome coverage per chunk, i.e. if you have 5Gbp pacbio data, but a 500 Mbp, you should create chunks (https://github.com/BioInf-Wuerzburg/proovread#getting-chunk-sizes-right-48)
  2. yes, exactly
faraz89 commented 6 years ago

Hi thackl,

Many thanks for answering my query. So any merge tool will work, for example just "cat" all the chunk files together? or there is some specific tool i need to use?

thackl commented 6 years ago

Yes. It's just plain fastq/fasta files, you can simply paste them together.

On 11/07/2017 09:02 AM, Faraz Khan wrote:

Hi thackl,

Many thanks for answering my query. So any merge tool will work, for example just "cat" all the chunk files together? or there is some specific tool i need to use?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/BioInf-Wuerzburg/proovread/issues/113#issuecomment-342490991, or mute the thread https://github.com/notifications/unsubscribe-auth/AI7ejxUCddtS3sazlym9frrIQOYk5v_Wks5s0GLygaJpZM4QUoEf.

faraz89 commented 6 years ago

Much appreciated. You are a star!

faraz89 commented 6 years ago

Hi Thomas, One thing that is making me crazy. Please help.

So I have 4 subreads file. 1 per smrt cell.

  1. Do i need to do one single chunk per subread file. Like for example, SeqChunker -s 1G -l 1 subreads.fastq >subreads_chunk.fq. This step will be done for all 4 subreads. Then that will give me 4 chunk files. One per smrt cell.
  2. And then run proovread on all four chunks one by one.
  3. Once i have the 4 error corrected reads file, i can just cat them together? Right?

Is that it? Any help would be appreciated.

Faraz.

thackl commented 6 years ago

SeqChunker does not change the content of the file, it just cuts it into smaller pieces. If you create a single chunk with the size of the entire subread file, you just making a copy of the file. So if your smrt cell files are >1GB in size, and you would are OK with chunks of that size, you don't need to chunk them at all. Just run proovread on each of the smrt cell files, one by one.

If you would want smaller files, you would run something like

SeqChunker -s 100M -o subreads_chunk-%03.fq subreads.fq
# creates files: subreads_chunk-001.fq, subreads_chunk-002.fq, ...

and that would create several 100MB large pieces from the input file. If you cat them together

cat subreads_chunk-*.fq > subreads_chunks-all.fq

you'd again have your original subreads.fq file. Does that make sense?

faraz89 commented 6 years ago

Thanks for always replying so swiftly. My files are >5GB. 12X coverage with 775 MB genome size As per the recommendation, i wanted to create chunks and would follow your above steps if that will improve the results.

  1. So once i cat all chunks, i can also merge the subreads from different smrt cell. For example, cat subreads_all_chunks_smrtcell1.fq subreads_all_chunks_smrtcell2.fq subreads_all_chunks_smrtcell3.fq subreads_all_chunks_smrtcell4.fq > All_smrt_cell_subreads.fq.

And that will give one single error corrected reads file, which i will use for gap filling later.

  1. I am getting this error when i run proovread on a small chunk.

[E::sam_parse1] SEQ and QUAL are of different length [W::sam_read1] parse error at line 25199 [main_samview] truncated file.

Any idea how to solve this?

Cheers.

thackl commented 6 years ago
  1. Yes, or also all at once cat subreads_chunks-*_smrtcell*.fq > All_smrt_cell_subreads.fq
  2. This would indicate that your .fq file is corrupted in some way. You can try and run SeqFilter --careful < chunk.fq, that should validate the file format, and/or indicate where the problem occurs.
faraz89 commented 6 years ago

Many thanks Thomas. I'll check!!!

faraz89 commented 6 years ago

Hi Thomas,

  1. I ran the test run as well as the 1st smrt cell run. It was successful. However it's taking way too long. Is it ok if i just use proovread without chunking for 6Gb file size smrt cell? Pacbio coverage is 12X and genome size is 775MBp. I read in one the issues you said running without chunking is fine as long as the pacbio data don't exceed genome size. How do i calculate that and is it the case for my data?
  2. My short-read coverage is 125X. I also read you said having large coverage can increase iterations. Should i just reduce it to 88X in order to have less runtime?
  3. As i have 125X per library (i have two libraries in total), do i really need to use both libraries. I can just use one and save runtime?

I am almost near to understand this tool properly and last inputs from you will be highly appreciated. Sorry for all the questions i have been asking.

Faraz.

thackl commented 6 years ago
  1. To compute coverage, divide the number of base pairs (bps) in the pacbio reads by the number of bps in the genome. If your pacbio data is fastq, the bps are either roughly File size in Bytes/2 and a bit, or you can use SeqFilter to accurately count. You don't want to have more than 1X pacbio coverage per chunk, so in you case not more than ~800Mbp, or roughly a 2GB fastq file.
  2. If you run proovread, specify your coverage, e.g. proovread --coverage 125, proovread will then internally subsample to the proper minimal coverage for each iteration. There shouldn't be a real difference in speed compared to runs with for example a 88X library.
  3. A single library with >50X is sufficient, no need to use both.

proovread does do high quality correction, but that comes at some performance costs. And at genome sizes close to ~1Gbp, I know that it really can become an issue... If runtime remains a problem, you might want to have a look at Lordec or Jabba instead.

faraz89 commented 6 years ago

Many thanks Thomas. I really appreciate the kind of effort you give to help others.

faraz89 commented 6 years ago

Hi Thomas,

I just got the results. The nohup file is attached. Please check. If you notice, the shortest read changed from 492bp (after stubby reads removal) to 98bp (output result). Is it normal?

Also regarding the increase in input sequences in the proovread results - Is the reason of the input sequences to increase in the final output is due to the fact that proovread cuts the sequences into pieces (near chimeric /siameric regions), hence more number of sequences?

nohup.txt