BioInf-Wuerzburg / proovread

PacBio hybrid error correction through iterative short read consensus
MIT License
60 stars 20 forks source link

Using proovread to remove errors from assembly #83

Closed ramadatta closed 8 years ago

ramadatta commented 8 years ago

Hi Thomas. I have generated a PacBio assembly which eventually gave me 15K contigs and ~400 K singletons. May I know if we can use proovread to correct bases in the assembly along with singletons as well, prior to merging with an alternative assembly?

thackl commented 8 years ago

I did not fully understand, what you want to do with the singletons. Do you want to use them to correct the assembled contigs, or do you want to use Illumina reads to correct the singletons, or the assembly, or both?

In general, you can run proovread on an assembly, i.e. use Illumina reads to correct minor errors in contigs. However, I did not implement any optimizations to split up large contigs internally into manageable chunks - hence if you have contigs of Mbp size, you might run into memory trouble. Also, given your contigs are consensus sequences with reduced error rates, you won't expect a typical pacbio error profile (15% insertions/deletions), but rather only a few errors. I, therefore, suggest to tweak the mapping behaviour - use strict mapping right away and skip iterations. Put the following in a config file (for example my-proovread.cfg)

'mode-tasks' => {
      'asm' => ['read-long', 'bwa-mr-finish']
}

and run and the following command:

proovread -c proovread-assembly.cfg --mode asm ... 
ramadatta commented 8 years ago

Sorry for not being clear. I have 80x of Illumina data and want to error correct both pacbio assembly Singletons. For the assembly, I was little unclear if proovread would perform error correction, since the length of the contigs may even range from 100kb to few Mbs.

In that case, I will run error correction of assembly and singletons assembly seperately with different parameters as you mentioned above. Thank you so much!

thackl commented 8 years ago

Sound reasonable. Also, you should probably use the .untrimmed output of proovread, because you might otherwise end up cutting contigs into more pieces due to some unequal distributions of reads among multiple repeat copies, or some unexpected effects of the chimera/siamera trimming on full contigs.