different results when run proovread on subreads independently and all combined

BioInf-Wuerzburg / proovread

PacBio hybrid error correction through iterative short read consensus

MIT License

60 stars 20 forks source link

different results when run proovread on subreads independently and all combined #110

Closed Huanle closed 7 years ago

Huanle commented 7 years ago

Hi Thomas,

Thanks for developing this program. It is very efficient and helpful. I ran proovread on the following pacbio sequence files:

_12041_GBR_UNSW_m160521_203936_42272_c100938272550000001823211006101606_s1_p0.1.subreads.fasta.gz 12041_GBR_UNSW_m160521_203936_42272_c100938272550000001823211006101606_s1_p0.2.subreads.fasta.gz 12041_GBR_UNSW_m160521_203936_42272_c100938272550000001823211006101606_s1p0.3.subreads.fasta.gz and was able to get trimmed and untrimmed fa/fq files for each of them.

But when i combined all these 3 subreads files and then ran proovread, i got only untrimmed.fa/fq. What would be the explanation for this?

Also, should i use the trimmed.fa/fq or untrimmed.fa/fq for the downstream analyses such as genome assembly or simply for scaffolding contigs derived from assembling short Illumina reads ?

Thanks in advance for your help.

Kind Regards, Huanle

thackl commented 7 years ago

Hey Huanle,

this doesn't sound right. You should get trimmed.fa/fq in both cases. Seems like something went wrong with the merged read set. Can you post the log files of the two runs?

Cheers Thomas

Huanle commented 7 years ago

Hi Thomas, Thank you a lot for your prompt response. I delete the log file generated from running independent analysis. I will regenerate it again. The content below is from the log file produced through analyzing the combined files. I hope it contains useful information that will help your diagnosis. Again, i would love to have your suggestion on the usage of the untrimmed results. The command i used: _proovread -l 12041_GBR_UNSW_m160603_211819_42272_c100938742550000001823211006101685_s1_p0.1.subreads.fasta -u ../final_contigs.fasta -s ../230.bbnorm.1.fastq.gz -s ../230.bbnorm.2.fastq.gz -s ../500.bbnorm.1.fastq.gz -s ../500.bbnorm.2.fastq.gz -s ../all.mp.fastq -p c100938742550000001823211006101685_s1p0.1 -t 16 --overwrite

Thanks heaps again.

Cheers - Huanle combined.log

thackl commented 7 years ago

Hi thanks for the log - that helped a lot.

1) you are using a rather old version of proovread, try to get the latest one from github 2) there is an error in the log, it is unfortunately well hidden (this is fixed in the newer version). Proovread cannot handle gzipped short read files because for its short read sampling process it needs random access to the short reads data. The results you got are from correction of the pacbio reads with contigs only, short reads were ignored.

Cheers Thomas

Huanle commented 7 years ago

Hi Thomas, Thanks very much again for your help to sort it out. Do you recommend discarding the untrimmed results if i get also trimmed from proovread?

Cheers - Huanle

thackl commented 7 years ago

For most pacbio reads, it is not possible to correct the entire read, e.g. due to spots with extremely high error rates or gaps in illumina coverage. Untrimmed reads comprise the original read in full length, with both corrected and uncorrected regions. Trimmed reads are only the high quality corrected parts of the read, spliced out into individual shorter pieces. For most application, e.g. assembly, you probably want trimmed reads. However, untrimmed reads have additional long range information in cases where trimming results in multiple shorter trimmed pieces. This might be useful in scaffolding applications, or when looking at structural changes.

Huanle commented 7 years ago

Many thanks Thomas. That makes sense and helps. I think i will replace the uncorrected regions with Ns and use them for scaffolding.