FenTechSolutions / CausalDiscoveryToolbox

Package for causal inference in graphs and in the pairwise settings. Tools for graph structure recovery and dependencies are included.
https://fentechsolutions.github.io/CausalDiscoveryToolbox/html/index.html
MIT License
1.13k stars 199 forks source link

load_dataset in cdt.data is "random" #25

Closed ArnoVel closed 5 years ago

ArnoVel commented 5 years ago

Hello. I am currently working on multiple Pairwise algorithms, and found the following problems. Whenever i load the TCEP data as follows, data, labels = load_dataset('tuebingen')

the apparent order of the pairs is different. I had this problem because of the following: to test threshold-dependent algos (such as ANM, GNN) i had a single jupyter file in which i would have a single instance of data, labels loaded, and then threshold and compute metrics on the predictions of different pre-recorded scores.

But each time i would call data, labels = load_dataset('tuebingen') and then compute_metrics(preds,labels) , they would all change ?! I was quite worried when the accuracy of cdt implementations of RCC,ANM and IGCI were as low as 40% on TCEP ...

Thank you for this wonderful work by the way!

diviyank commented 5 years ago

Hi, thank you for the kind words!

Oh there is a an error in the default options,my bad. Actually there is a hidden argument shuffle that is wrongly set to random.

Please try :,load_dataset('tuebingen', shuffle=False)

I will fix that in next version. Thanks a lot for the feedback. Best, Diviyan

diviyank commented 5 years ago

Hi again,

This bug should be fixed in 0.5.3.

Best regards, Diviyan

diviyank commented 5 years ago

I will be closing this issue, as it should be solved. Don't hesitate to reopen it if the bug still persists in the latest version. Best, Diviyan