FenTechSolutions / CausalDiscoveryToolbox

Package for causal inference in graphs and in the pairwise settings. Tools for graph structure recovery and dependencies are included.
MIT License
1.08k stars 198 forks source link

CGNN results question #63

Closed sAviOr287 closed 4 years ago

sAviOr287 commented 4 years ago


So I have tried to run the experiments again for the CGNN pairwise experiments.

And I can confirm to get the same results for the Multi, Gauss, Net, Tueb datasets in terms of AUPRC (using 12 different runs to ensemble) AUPR: 0.95 MULTI AUPR: 0.80 GAUSS AUPR: 0.90 NET

However when I look at the acc ie. predicting the actual direction I get: 0.43, 0.46, 0.49 respectively.

I compute the acc by the score

for dataset_name in ['multi', 'gauss', 'net']:
    data, labels = load_dataset(dataset_name)
    res = genfromtxt('results/res2_{}.csv'.format(dataset_name), delimiter=',', skip_header=True)
    idx = 0
    acc = 0
    labels = labels.to_numpy()
    for data_ in res[:, 1]:
        if data_ < 0 and labels[idx] == -1:
            acc += 1
                        idx += 1 # EDIT
        elif data_ > 0 and labels[idx] == 1:
            acc += 1
                        idx += 1 # EDIT
            idx += 1

    acc /= (res.shape[0]-1)
    print('{} ACC : {}'.format(acc, dataset_name))
    aupr, curve = precision_recall(labels[:res.shape[0]], res[:, 1])
    print('AUPR: {}'.format(aupr))

This method also gives me around 74% unweighted on Tueb dataset.

So my question is whether this is expected or whether i should be computing the acc differently or maybe even the ACC doesnt matter?

Thanks for the clarification in advance.


diviyank commented 4 years ago

Hello, the results seem good, the accuracy should follow, could you join a sample of data_ ? I think the predictions are not in the expected shape..


sAviOr287 commented 4 years ago

res2_gauss.csv.zip Hi

I have added the csv file that comes out after training the model.

Thanks for your help!


sAviOr287 commented 4 years ago

Here is also the way I loaded the data. I add this in cdt/data/loader.py

def load_ce_gauss(shuffle=False):
    dirname = os.path.dirname(os.path.realpath(__file__))

    data = read_causal_pairs('{}/resources/CE-Gauss_pairs.csv'.format(dirname), scale=False)
    labels = pd.read_csv('{}/resources/CE-Gauss_targets.csv'.format(dirname)).set_index('SampleID')

    if shuffle:
        for i in range(len(data)):
            if random.choice([True, False]):
                labels.iloc[i, 0] = -1
                buffer = data.iloc[i, 0]
                data.iloc[i, 0] = data.iloc[i, 1]
                data.iloc[i, 1] = buffer
    return data, labels

def load_ce_multi(shuffle=False):
    dirname = os.path.dirname(os.path.realpath(__file__))

    data = read_causal_pairs('{}/resources/CE-Multi_pairs.csv'.format(dirname), scale=False)
    labels = pd.read_csv('{}/resources/CE-Multi_targets.csv'.format(dirname)).set_index('SampleID')

    if shuffle:
        for i in range(len(data)):
            if random.choice([True, False]):
                labels.iloc[i, 0] = -1
                buffer = data.iloc[i, 0]
                data.iloc[i, 0] = data.iloc[i, 1]
                data.iloc[i, 1] = buffer
    return data, labels

def load_ce_net(shuffle=False):
    dirname = os.path.dirname(os.path.realpath(__file__))

    data = read_causal_pairs('{}/resources/CE-Net_pairs.csv'.format(dirname), scale=False)
    labels = pd.read_csv('{}/resources/CE-Net_targets.csv'.format(dirname)).set_index('SampleID')

    if shuffle:
        for i in range(len(data)):
            if random.choice([True, False]):
                labels.iloc[i, 0] = -1
                buffer = data.iloc[i, 0]
                data.iloc[i, 0] = data.iloc[i, 1]
                data.iloc[i, 1] = buffer
    return data, labels
diviyank commented 4 years ago

Hello, Whoops I forgot to ask if you had the labels as well ?

sAviOr287 commented 4 years ago

oh yeah I have


which I downloaded from https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3757KX

Thanks for the reply


diviyank commented 4 years ago

Thanks for getting back to me quickly,

There seems to be an issue with your accuracy computation ; i got an accuracy of .72 on this dataset:

import pandas as pd
import numpy as np
from sklearn.metrics import average_precision_score, accuracy_score

preds = pd.read_csv('res2_gauss.csv')
labels = pd.read_csv('CE-Gauss_targets.csv')

print(labels.shape, preds.shape)
print(labels.columns, preds.columns)

# Returns :(300, 2) (300, 2)
# Returns : Index(['SampleID', 'Target'], dtype='object') Index(['SampleID', 'Predictions'], dtype='object')

average_precision_score(labels.Target, preds.Predictions) ## Equals to AUPR

# Returns :0.8027886920926466

preds.loc[preds.Predictions > 0, 'Predictions'] = 1
preds.loc[preds.Predictions < 0, 'Predictions'] = -1

# Returns :  0.7233333333333334

From my point of view, accuracy however might not be the best metric for evaluating causal algorithms: The confidence of an algorithm has to be taken into account, thus giving the possibility of not committing into a prediction if the prediction is not certain (Not answering is better that giving a wrong causal direction).

Best regards, Diviyan

sAviOr287 commented 4 years ago

Thanks a lot Sorry, I was an idiot ... I forgot to increment the idx variable

Thanks for your help

Sorry for the inconvenience

diviyank commented 4 years ago

No issues, glad I could help you! I'll be closing this issue, have a good day !