akelleh / causality

Tools for causal analysis
MIT License
1.06k stars 128 forks source link

Improve Test Coverage For Independence Testing #45

Closed frmsaul closed 6 years ago

frmsaul commented 6 years ago

Hi @akelleh,

Wrote some tests to improve the coverage for the independence testing module, if the coverage is better I could assure that the behavior doesn't change when we transition from pymc2 to pymc3. In the process of testing, I found a couple of bugs in the MixedChiSquaredTest function and added the fixes to this pull request. Do these look good? If not, what other tests should I write?

Wanted to get your feedback before I write tests for the MixedMutualInformationTest function.

Saul

frmsaul commented 6 years ago

BTW, Im sorry about the weird number of commits in this PR. Will fix it before merging with the main branch.

frmsaul commented 6 years ago

Hi @akelleh, Can you please take a look? Do these look good? If not, what other tests should I write?

frmsaul commented 6 years ago

Hi @akelleh! I squashed the change list, but have no permission to merge it with master. Please merge it when you get a chance.

PS: Do you know where I can find some reading material about MixedChiSquaredTest and the theory behind it?

akelleh commented 6 years ago

I made up the Mixed Chi Squared Test. The basic idea is that discretization loses information, but we don't know how much. We'd like to have a critical value for the chi2 test based on the same distribution except (1) discretized, and (2) where the variables are independent. To do that, I fit the joint distribution, then take the product of conditional marginals to get the independent distribution. Next, I sample from that distribution, and then discretize the sample using the same discretization procedure as with the real data. I do the process repeatedly to get a sampling distribution for the chi2 statistic, then run a chi2 test on the discretized data, comparing it to the new critical value (based on the quantiles of the sampling distribution). Make sense?

akelleh commented 6 years ago

There are probably more efficient procedures, but that's the one I came up with at the time. In particular, I bet there's a clever way to get conditional marginals without having to fit the distribution, e.g. by restricting down the data and bootstrapping samples. There's a lot of stats work to do here.