--no-sampling setting or not

xinwang-bio commented 7 years ago

I used proovread to correct my iso-seq data. I found that many researchers ask about how to use proovread on the iso-seq. You recommended that use the setting of --no-sampling. But I don't very clear about the meaning of this setting. What 's the different between use it and nor use it ?

Thank you

thackl commented 7 years ago

By default proovread assumes a more or less even coverage of the pacbio reads, i.e. 50X, 100X, .., which you specify via the --coverage parameter. Because coverage is even and known, proovread then during the different iterations subsamples the reads sets for better speed, i.e. it runs three iterations with 33X each...

This behaviour does not make sense for iso-seq data, because the coverage for different reads with illumina data can differ a lot. --no-sampling tells proovread to make no assumptions about the coverage and to not subsample during iterations. That way you get the best performance also for low coverage transcripts. Hope that helps.

xinwang-bio commented 7 years ago

Thank you very much. Have a good weekend.

BioInf-Wuerzburg / proovread

--no-sampling setting or not #88