Closed bagashe closed 4 years ago
Your initial comment on this PR says:
Reads.minReadLength
should now default to a potentially smaller value (from the current default value of 10,000).
Why would this be the case? Can you clarify?
Your initial comment on this PR says:
Reads.minReadLength
should now default to a potentially smaller value (from the current default value of 10,000).Why would this be the case? Can you clarify?
If desired coverage is specified, then minReadLength
can be conceptually repurposed to specify the lower bound of what an acceptable read length is for Shasta, in general. For example, if Shasta's algorithm works best when read lengths are longer than 3000 bases, then that's what we can set it to. In other words, it can now be a function of something other than available coverage.
"if Shasta's algorithm works best when read lengths are longer than 3000 bases, then that's what we can set it to."
Oh I see. In practice, at least for human genomes, 10 Kb seems a good value - probably due to the existence of LINE repeats in the human genome which are 6 Kb long. In other words, if you only have coverage 30X in reads longer than 10 Kb, it is probably not a good idea to increase coverage by allowing shorter reads. What you really need is more coverage.
The more common usage pattern, at least for human genomes, will be the case where you have a lot of coverage - say 200X - and you want to reduce it to something like 70X by increasing minReadLength
.
Closing this in favor of https://github.com/chanzuckerberg/shasta/pull/178.
Desired coverage, as a number of raw bases, can be set using the new configuration parameter -
Reads.desiredCoverage
. This value will be auto-generated by theGenerateConfig.py
script (different PR).Reads.minReadLength
should now default to a potentially smaller value (from the current default value of 10,000).Reads.minReadLength
, the program will abort.Reads.minReadLength
, then Shasta will increase the value ofminReadLength
to arrive close to the desired coverage