cytomining / cytominer-eval

Common Evaluation Metrics for DataFrames
BSD 3-Clause "New" or "Revised" License
7 stars 11 forks source link

Support multiple columns as replicate indicators #28

Open gwaybio opened 3 years ago

gwaybio commented 3 years ago

In grit() and mp_value() specifically, we can add support for a list of columns indicating replicates vs. just a single string (so one column)

gwaybio commented 3 years ago

can also do for group_id

gwaybio commented 3 years ago

I decided today not to pursue multiple column support for group_id. The difficulty arises when the time comes to define control perturbations. The way we currently formulate grit, is based on pairwise correlations between the target profiles and all other profiles. If we add multiple columns to group_id, we will also need to specify a column hierarchy of which group should be ignored when determining the control (reference).

In other words, specifying multiple groups would require us to specify which group should be ignored when specifying controls. For example, if I calculate grit on two plates of CRISPR profiles using "target gene" and "cell line" as the group_id, I only want to use the pairwise correlations to controls within cell line. Adding multiple groups would complicate things substantially. The current approach calling evaluate twice with only one group is our preferred method in version 0.1. Calling it twice per unique group also currently reduces the amount of unnecessary pairwise correlation calculations.