FenTechSolutions / CausalDiscoveryToolbox

Package for causal inference in graphs and in the pairwise settings. Tools for graph structure recovery and dependencies are included.
https://fentechsolutions.github.io/CausalDiscoveryToolbox/html/index.html
MIT License
1.08k stars 198 forks source link

TCEP dataset incoherent with 'official' version? #52

Open ArnoVel opened 4 years ago

ArnoVel commented 4 years ago

Hi, After I opened the issue about the labels being all set to 1, I went to check the tcep reference website to identify some pairs that got permuted and so on.

I stumbled across something strange: in the original dataset, some of the variables are multivariate. This can be seen, for example, in pair 54 or pair 71.

However, in the current version of cdt, when checking these two pairs, one finds 1D variables.

> data.iloc[53]
A    [43.51, 41.33, 36.78, -8.82, 34.61, 40.11, 12....
B    [42.0, 75.0, 69.0, 42.0, 76.0, 72.0, 77.0, 81....
Name: pair54, dtype: object
> data.iloc[53]['A']
array([ 43.51,  41.33,  36.78,  -8.82,  34.61,  40.11,  12.52, -35.18,
        48.12,  40.24,  25.4 ,  26.19,  23.71,  13.09,  53.97,  50.83,
        17.25,   6.48,  27.44, -16.3 ,  43.86, -24.66, -15.78,   4.94,
        42.71,  12.35,  -3.38,  11.54,   3.86,  45.42,   4.36,  12.11,
        49.42, -33.45,  31.14, -11.7 ,  -4.25,  -4.33,   9.92,   5.33,
        45.8 ,  23.  ,  35.17,  50.08,  55.68,  11.58,  18.48,  -0.23,
        30.06,  13.7 ,   3.75,  15.33,  59.43,   9.  ,   6.92, -18.14,
        60.17,  48.86,   4.93, -17.54,   0.39,  13.44,  41.7 ,  52.52,
         5.54,  37.97,  12.05,  16.  ,  13.47,  14.62,   9.54,  11.86,
         6.8 ,  18.54,  14.08,  22.3 ,  47.5 ,  64.14,  28.63,  -6.19,
        35.71,  33.32,  53.34,  31.79,  41.9 ,  18.  ,  35.68,  31.94,
        51.18,  -1.28,  39.02,  37.51,  29.37,  42.87,  17.97,  56.95,
        33.89, -29.3 ,   6.31,  32.88,  54.69,  49.61,  22.18,  42.  ,
       -18.92, -13.99,   3.15,   4.17,  12.65,  35.9 ,  14.6 ,  18.07,
       -20.16,  19.42,  47.91,  42.46,  33.99, -25.97,  19.74, -22.57,
        27.71,  52.37,  12.1 , -22.28, -41.29,  12.15,  13.52,   9.06,
        59.91,  23.61,  33.68,  31.88,   8.99,  -9.47, -25.3 , -12.09,
        14.58,  52.22,  38.71,  18.45,  25.29,  47.01, -20.87,  44.45,
        55.76,  27.15,  14.  , -13.83,   0.34,  24.67,  14.7 ,  44.8 ,
         8.47,   1.29,  48.21,  46.05,  -9.43,   2.04, -25.75,  40.42,
         6.92,  13.2 ,  15.63,   5.82, -26.32,  59.33,  46.95,  33.52,
        38.57,  -6.17,  13.76,  -8.57,   6.12, -21.14,  10.66,  36.81,
        39.94,  37.95,   0.31,  50.44,  24.48,  51.5 ,  38.89,  18.34,
       -34.89,  41.32, -17.74,  10.5 ,  21.03,  15.36, -15.41, -17.82])

Same can be seen about pair 71. Is this a mistake, or just a shuffling of the data? I made sure I set shuffle=False before testing for the two pairs. If the basic (non-shuffled) dataset is already shuffled, or has been pre-processed in some way to reduce dimensionality, can we have some explanation of how the two datasets relate to each other?

Any amount of information would help, Thanks

diviyank commented 4 years ago

Hi, This is concerning, I will look at how I got this version of the TCEP and come back to you.

Best, Diviyan

ArnoVel commented 4 years ago

Hi, Would it be possible to have some kind of update? Thanks!

ArnoVel commented 4 years ago

Hi, I checked the official website in details, It highly likely that the current CDT TCEP version is simply the current TCEP (with 108 pairs) with the multivariate one removed. This is likely as it leaves 99 pairs, which is the current length of the CDT TCEP. However I did not take the time to check if the two match. Regards, A.V

diviyank commented 4 years ago

Hi, Sorry for the delay, I was quite busy lately. Thanks for looking into it; it seems to be indeed the case. I just checked all the pairs and they match. I will add another dataset containing all the pairs (including the multivariate ones) because most of the algorithms do not support multivariate variables. Best regards, Diviyan