andersen-lab / Freyja

Depth-weighted De-Mixing
BSD 2-Clause "Simplified" License
100 stars 29 forks source link

Clarify difference between `--covcut` and `--depthcutoff` #212

Open tavareshugo opened 4 months ago

tavareshugo commented 4 months ago

If I understood the documentation correctly, --covcut is only used to calculate the genome coverage at a given depth (10x by default), but it doesn't influence the abundance estimation, is that correct?

Only --depthcutoff would affect abundance estimation, as it would exclude sites with depth < threshold (0 by default, i.e. all sites with at least 1 read are used).

If that is the case, what is the purpose of having --covcut? It would seem more intuitive to me that the "coverage" value output would be the fraction of the genome that was used for the demix inference step.

joshuailevy commented 4 months ago

Good question. --covcut is really just to modify the calculation used to generate the "coverage" output (10x by default), as there's no reason everyone needs to stick with that arbitrary choice of threshold. We're working on a more detailed version of our documentation now- we'll make sure to describe the differences more clearly to users there.

tavareshugo commented 4 months ago

Thanks for clarifying. I guess describing their different in the docs would help indeed. Mostly to avoid that people mistakenly use --covcut when they mean to use --depthcutoff, as the two options are essentially a depth threshold, but used for different things: calculating coverage or estimating abundances, respectively.

Related to this then, I wonder if it would be worth to output the fraction of informative sites above --depthcutoff used for the demix step. For example, if there are 1000 total informative sites from UShER, what fraction of those were used by demix.