cschin / Peregrine

Peregrine: Fast Genome Assembler Using SHIMMER Index
Other
99 stars 9 forks source link

assembly contiguity not good - coverage parameter? #38

Closed mgraeff closed 3 years ago

mgraeff commented 3 years ago

Hi,

I prepared subsamples with different coverages of my dataset and assembled each of them.

The command: pg_run.py asm \ seqdata.$ACC.$Q.$SET.$DET.lst 48 48 48 48 48 48 48 48 48 \ --with-consensus --shimmer-r 8 --best_n_ovlp 16 \ --output $outputTMP

The NGx plot of the assemblies: Peregrine-HiFi-coverage-NGx.pdf

The problem: at least the subsamples with higher coverages should lead to more contiguous assemblies from my experience with this dataset. What is especially concerning is that all of those subsamples produced similar but not identical lengths of contigs, but not even ordered after coverage. To make it short, it behaves very random. It just looks like a parameter is not correctly adjusted. Which leads to my question:

Is there some standard parameter limiting the maximum input coverage? Or another parameter I have to adjust that could explain this behaviour?

I would be glad about every hint!

cschin commented 3 years ago

@mgraeff It is sorted of known effect, when the contig size may not increase when the coverage increase. Peregrine does minimum error correction. High coverage can increase the total number of more erroneous reads. Those additional erroneous may produce inconsistent branches in the assembly graph which will break contigs. Peregrine does not have a specific way to limit the coverage. (It is rather easy to subset sample reads as input.) BTW, thanks for the coverage analysis. It is actually useful to understand the coverage vs. continuity relationship to optimize the algorithm. It is possible to get some good heuristic for coverage analysis to get better assembly automatically. We will need more information from the analysis like you did. Thanks for that.

mgraeff commented 3 years ago

@cschin Thanks for this information. It is good to know that this behaviour is just a matter of a relatively low coverage optimum and not a mistake within my parameters. I am glad if this might also help you.