Issues with groupby columns

michaelbornholdt commented 2 years ago

Soo,

For prec_recall and Hitk we know have the case that the input groupby columns determines by which columns the similarity df is sorted. This has a important impact on your solution. If you for example sort by something that is not unique, ie not unique in the input df - then you will get internal connections in the sub dataframe that you are grouping.

Lets say you have for example a df with Sampels and different dosages. If you then have groupby_columns = Metadata_broad_sample, then you will sort into sub groups that have several connections within each other (all the different doses). And your precision will have the weird effects that @FloHu described in #62 for example. Similarly, hitk will have weird results because you are now looking at internal connections and not only the nearest neighbors of one sample.

Either we keep it all this way and make users aware of this or we find some workaround here? Maybe the solution is to not allow anything other than unique groupby_cols ?

michaelbornholdt commented 2 years ago

This is very hard to explain to someone who is not deep in the matter...

gwaybio commented 2 years ago

This to me seems like an important problem for us to solve

michaelbornholdt commented 2 years ago

Will be solved after the 11/12 , after my thesis

michaelbornholdt commented 2 years ago

@shntnu The text above gives an example. This should be discussed in the context of the architecture overhaul of Cyto eval.

cytomining / cytominer-eval

Issues with groupby columns #67