Using Weighted Data with Phik

Statgnome commented 3 years ago

If I understand things correctly, because of how phik solves for rho, to use weighted data with phik one needs to be able to supply weighted contingency tables to key functions. If the data is merely weighted in the case of continuous data, it will still count as 1 within the uniform bin. Has their been any consideration to using weighted data? Can you provide any guidance on how to use weighted data for any data case with phik?

Since this is not really an issue with the code or implementation, if other channels of communication are preferred, please let me know. I am trying to integrate phik into some analytics work, but weighting is very important to how the data is understood where I work.

mbaak commented 3 years ago

Interesting question, happy to discuss it here.

To use weighted data correctly, there are a few places where the library would need to be updated: the calculation of phik, the significance evaluation, and the outlier significances would be all be affected.

For (only) the calculation of phik there are three important things I can think of:

the out-of-the-box chi^2 value is no longer appropriate, b/c fluctuations in cells scale with the weight per cell. This can (most likely) be accounted for by using an updated chi^2 formula;
the d.o.f. will be different, no longer (r - 1) * (k - 1), although this could be determined experimentally. We only evaluate phik in case chi^2 > dof, to account for statistical noise, else phik = 0; and
(most importantly) the formula for the maximum chi^2 value, of N min(r-1, k-1), for which phik=1, is no longer valid and needs to be corrected. Finding the right formula for that is important. Perhaps it's (sum_i w_i) * min(r-1, k-1), but that needs investigation.

I'm quite sure the phi_k calculation can be made to work, but it needs a bit of effort/study to get it right. (Eg. deriving the right max chi^2 formula.) If you're interested in this, then I'm happy to pick it up together though. Let me know!

Statgnome commented 3 years ago

It would be great to work on this, but at the moment I may have too much going on. I'll try to get back to you about it if I become more available. I'm a big fan of the work you've already done, and we are using phik now for unweighted analyses to inform model selection. Thanks!

mbaak commented 3 years ago

Glad to read that phik is useful for you. If you have time/interest later on then don't hesitate to reach out, and let's have a go at it.

KaveIO / PhiK

Using Weighted Data with Phik #26