load_dataset in cdt.data is "random"

ArnoVel commented 5 years ago

Hello. I am currently working on multiple Pairwise algorithms, and found the following problems. Whenever i load the TCEP data as follows, data, labels = load_dataset('tuebingen')

the apparent order of the pairs is different. I had this problem because of the following: to test threshold-dependent algos (such as ANM, GNN) i had a single jupyter file in which i would have a single instance of data, labels loaded, and then threshold and compute metrics on the predictions of different pre-recorded scores.

But each time i would call data, labels = load_dataset('tuebingen') and then compute_metrics(preds,labels) , they would all change ?! I was quite worried when the accuracy of cdt implementations of RCC,ANM and IGCI were as low as 40% on TCEP ...

Thank you for this wonderful work by the way!

diviyank commented 5 years ago

Hi, thank you for the kind words!

Oh there is a an error in the default options,my bad. Actually there is a hidden argument shuffle that is wrongly set to random.

Please try :,load_dataset('tuebingen', shuffle=False)

I will fix that in next version. Thanks a lot for the feedback. Best, Diviyan

diviyank commented 5 years ago

Hi again,

This bug should be fixed in 0.5.3.

Best regards, Diviyan

diviyank commented 5 years ago

I will be closing this issue, as it should be solved. Don't hesitate to reopen it if the bug still persists in the latest version. Best, Diviyan

FenTechSolutions / CausalDiscoveryToolbox

load_dataset in cdt.data is "random" #25