CGNN results question - Githubissues

sAviOr287 commented 4 years ago

Hi,

So I have tried to run the experiments again for the CGNN pairwise experiments.

And I can confirm to get the same results for the Multi, Gauss, Net, Tueb datasets in terms of AUPRC (using 12 different runs to ensemble) AUPR: 0.95 MULTI AUPR: 0.80 GAUSS AUPR: 0.90 NET

However when I look at the acc ie. predicting the actual direction I get: 0.43, 0.46, 0.49 respectively.

I compute the acc by the score

for dataset_name in ['multi', 'gauss', 'net']:
    data, labels = load_dataset(dataset_name)
    res = genfromtxt('results/res2_{}.csv'.format(dataset_name), delimiter=',', skip_header=True)
    idx = 0
    acc = 0
    labels = labels.to_numpy()
    for data_ in res[:, 1]:
        if data_ < 0 and labels[idx] == -1:
            acc += 1
                        idx += 1 # EDIT
        elif data_ > 0 and labels[idx] == 1:
            acc += 1
                        idx += 1 # EDIT
        else:
            idx += 1

    acc /= (res.shape[0]-1)
    print(res.shape[0])
    print('{} ACC : {}'.format(acc, dataset_name))
    aupr, curve = precision_recall(labels[:res.shape[0]], res[:, 1])
    print('AUPR: {}'.format(aupr))

This method also gives me around 74% unweighted on Tueb dataset.

So my question is whether this is expected or whether i should be computing the acc differently or maybe even the ACC doesnt matter?

Thanks for the clarification in advance.

Best

diviyank commented 4 years ago

Hello, the results seem good, the accuracy should follow, could you join a sample of data_ ? I think the predictions are not in the expected shape..

Best,

sAviOr287 commented 4 years ago

res2_gauss.csv.zip Hi

I have added the csv file that comes out after training the model.

Thanks for your help!

Best

sAviOr287 commented 4 years ago

Here is also the way I loaded the data. I add this in cdt/data/loader.py

def load_ce_gauss(shuffle=False):
    dirname = os.path.dirname(os.path.realpath(__file__))

    data = read_causal_pairs('{}/resources/CE-Gauss_pairs.csv'.format(dirname), scale=False)
    labels = pd.read_csv('{}/resources/CE-Gauss_targets.csv'.format(dirname)).set_index('SampleID')

    if shuffle:
        for i in range(len(data)):
            if random.choice([True, False]):
                labels.iloc[i, 0] = -1
                buffer = data.iloc[i, 0]
                data.iloc[i, 0] = data.iloc[i, 1]
                data.iloc[i, 1] = buffer
    return data, labels

def load_ce_multi(shuffle=False):
    dirname = os.path.dirname(os.path.realpath(__file__))

    data = read_causal_pairs('{}/resources/CE-Multi_pairs.csv'.format(dirname), scale=False)
    labels = pd.read_csv('{}/resources/CE-Multi_targets.csv'.format(dirname)).set_index('SampleID')

    if shuffle:
        for i in range(len(data)):
            if random.choice([True, False]):
                labels.iloc[i, 0] = -1
                buffer = data.iloc[i, 0]
                data.iloc[i, 0] = data.iloc[i, 1]
                data.iloc[i, 1] = buffer
    return data, labels

def load_ce_net(shuffle=False):
    dirname = os.path.dirname(os.path.realpath(__file__))

    data = read_causal_pairs('{}/resources/CE-Net_pairs.csv'.format(dirname), scale=False)
    labels = pd.read_csv('{}/resources/CE-Net_targets.csv'.format(dirname)).set_index('SampleID')

    if shuffle:
        for i in range(len(data)):
            if random.choice([True, False]):
                labels.iloc[i, 0] = -1
                buffer = data.iloc[i, 0]
                data.iloc[i, 0] = data.iloc[i, 1]
                data.iloc[i, 1] = buffer
    return data, labels

diviyank commented 4 years ago

Hello, Whoops I forgot to ask if you had the labels as well ?

sAviOr287 commented 4 years ago

oh yeah I have

Archive.zip

which I downloaded from https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3757KX

Thanks for the reply

Best

diviyank commented 4 years ago

Thanks for getting back to me quickly,

There seems to be an issue with your accuracy computation ; i got an accuracy of .72 on this dataset:

import pandas as pd
import numpy as np
from sklearn.metrics import average_precision_score, accuracy_score

preds = pd.read_csv('res2_gauss.csv')
labels = pd.read_csv('CE-Gauss_targets.csv')

print(labels.shape, preds.shape)
print(labels.columns, preds.columns)

# Returns :(300, 2) (300, 2)
# Returns : Index(['SampleID', 'Target'], dtype='object') Index(['SampleID', 'Predictions'], dtype='object')

average_precision_score(labels.Target, preds.Predictions) ## Equals to AUPR

# Returns :0.8027886920926466

preds.loc[preds.Predictions > 0, 'Predictions'] = 1
preds.loc[preds.Predictions < 0, 'Predictions'] = -1
accuracy_score(labels.Target,preds.Predictions)

# Returns :  0.7233333333333334

From my point of view, accuracy however might not be the best metric for evaluating causal algorithms: The confidence of an algorithm has to be taken into account, thus giving the possibility of not committing into a prediction if the prediction is not certain (Not answering is better that giving a wrong causal direction).

Best regards, Diviyan

sAviOr287 commented 4 years ago

Thanks a lot Sorry, I was an idiot ... I forgot to increment the idx variable

Thanks for your help

Sorry for the inconvenience

diviyank commented 4 years ago

No issues, glad I could help you! I'll be closing this issue, have a good day !

FenTechSolutions / CausalDiscoveryToolbox

CGNN results question #63