DASL-Lab / provoc

PROportions of Variants of Concern using counts, coverage, and a variant matrix.
https://dasl-lab.github.io/provoc/
MIT License
0 stars 0 forks source link

Change the way the main `provoc` function works. #2

Closed DBecker7 closed 6 months ago

DBecker7 commented 8 months ago

The interfaces requires data frames with particular formatting and column names. This is not acceptable.

Ideally, the provoc function should take something like a formula argument similar to a binomial glm, e.g.,

provoc(
    formula = cbind(count_column_name, coverage_column_name) ~ B.1.1.7 + B.1.617.2,
    data = mydata,
    mutation_defs = astronomize(),
    by = "sra"
)

where the user can specify the lineages however they want. If they supply a matrix with column names corresponding to their lineages it should work, otherwise it should use a built-in set of definitions (currently astronomize(), but this can change). The lineage definition will be joined with mydata according to the relevant mutations, potentially giving a warning if there aren't many mutations in common (in practice, there are very few mutations used in lineage definitions, so the warning cutoff may need to be quite low).

If mutation_defs is NULL, then the function should check the column names of mydata.

cbind(count, coverage) ~ . should check for known definition names in mydata and simply use those as the definitions. If ambiguous, return an error.

Note that count_column_name in cbind(count_column_name, coverage_column_name) is interpreted as a column in mydata - it doesn't need to be a vector in the user's environment. This will require some knowledge of parsing formulas in R.

Currently, the function looks for a column labelled "sample", then applies the model to each sample separately. This should be replaced by a by argument, which allows the user to specify a column of their choosing (in this example, mydata$sra). This may be multiple columns, such as location and date, which uniquely define each sample.

Ideally, the output should be an object of class provoc.

danerkestey commented 8 months ago

Hi @DBecker7, thank you for your suggestions. I agree with your improvements, here are some possible improvements to make it more adaptable and user-friendly:

Let me know what you think about these improvements, and if I should alter my approach