Reduce coverage to the desired value if applicable.

chanzuckerberg / shasta

[MOVED] Moved to paoloshasta/shasta. De novo assembly from Oxford Nanopore reads

Other

272 stars 59 forks source link

Reduce coverage to the desired value if applicable. #177

Closed bagashe closed 4 years ago

bagashe commented 4 years ago

Desired coverage, as a number of raw bases, can be set using the new configuration parameter - Reads.desiredCoverage. This value will be auto-generated by the GenerateConfig.py script (different PR). Reads.minReadLength should now default to a potentially smaller value (from the current default value of 10,000).

If there isn't enough coverage available with reads longer than Reads.minReadLength, the program will abort.
If more than desired coverage is available with reads longer than Reads.minReadLength, then Shasta will increase the value of minReadLength to arrive close to the desired coverage

paoloczi commented 4 years ago

Your initial comment on this PR says:

Reads.minReadLength should now default to a potentially smaller value (from the current default value of 10,000).

Why would this be the case? Can you clarify?

bagashe commented 4 years ago

Your initial comment on this PR says:

Reads.minReadLength should now default to a potentially smaller value (from the current default value of 10,000).

Why would this be the case? Can you clarify?

If desired coverage is specified, then minReadLength can be conceptually repurposed to specify the lower bound of what an acceptable read length is for Shasta, in general. For example, if Shasta's algorithm works best when read lengths are longer than 3000 bases, then that's what we can set it to. In other words, it can now be a function of something other than available coverage.

paoloczi commented 4 years ago

"if Shasta's algorithm works best when read lengths are longer than 3000 bases, then that's what we can set it to."

Oh I see. In practice, at least for human genomes, 10 Kb seems a good value - probably due to the existence of LINE repeats in the human genome which are 6 Kb long. In other words, if you only have coverage 30X in reads longer than 10 Kb, it is probably not a good idea to increase coverage by allowing shorter reads. What you really need is more coverage.

The more common usage pattern, at least for human genomes, will be the case where you have a lot of coverage - say 200X - and you want to reduce it to something like 70X by increasing minReadLength.

bagashe commented 4 years ago

Closing this in favor of https://github.com/chanzuckerberg/shasta/pull/178.