F-LIDM / CFA4DD

2 stars 5 forks source link

Low AUROC Value - BBB_Martins #4

Open drorhunvural opened 6 months ago

drorhunvural commented 6 months ago

Hi,

First of all, I would like to congratulate you for the work you have done.

I am attempting to achieve an AUROC value of 0.920 for the BBB_Martins dataset, as reported in the paper.

image

When I run the code below, I get the highest value of 0.564. Could you identify any errors or guide me on where my approach might be incorrect?

group = admet_group(path='data/')

# Specify the dataset name
name = 'BBB_Martins'

# Load the BBB_Martins dataset
benchmark = group.get(name)
train_val, test = benchmark['train_val'], benchmark['test']
y_test = np.array(test.Y)

# Create dictionaries for y_train and y_valid for different seeds
y_train_dict = {}
y_valid_dict = {}

for seed in [1, 2, 3, 4, 5]:
    train, valid = group.get_train_valid_split(benchmark=name, split_type='default', seed=seed)
    y_train_dict[seed] = train.Y
    y_valid_dict[seed] = valid.Y

predictions_val_xgb = [np.random.rand(len(y_valid_dict[seed])) for seed in range(1, 6)]  # Replace with actual predictions
predictions_val_rf = [np.random.rand(len(y_valid_dict[seed])) for seed in range(1, 6)]   # Replace with actual predictions
predictions_val_svm = [np.random.rand(len(y_valid_dict[seed])) for seed in range(1, 6)]  # Replace with actual predictions

# Convert validation predictions for each model into a DataFrame
# Assuming predictions_val_xgb, predictions_val_rf, predictions_val_svm are lists of arrays for each seed
df_val_xgb = pd.DataFrame(predictions_val_xgb).transpose()  # Each column represents predictions for one seed
df_val_rf = pd.DataFrame(predictions_val_rf).transpose()
df_val_svm = pd.DataFrame(predictions_val_svm).transpose()

# Convert test predictions for each model into a DataFrame
# Test predictions are single arrays since there's only one test set
df_test_xgb = pd.DataFrame(predictions_test_xgb, columns=['test'])
df_test_rf = pd.DataFrame(predictions_test_rf, columns=['test'])
df_test_svm = pd.DataFrame(predictions_test_svm, columns=['test'])

# Now, create val_dfs_list and test_dfs_list with these DataFrames
val_dfs_list = [df_val_xgb, df_val_rf, df_val_svm]
test_dfs_list = [df_test_xgb, df_test_rf, df_test_svm]    

model_names = ['xgb', 'rf', 'svm'] # mention model names 
preds = cfafunctions.model_predictions(
    len(model_names),
    model_names,
    val_dfs_list=val_dfs_list,
    test_dfs_list=test_dfs_list
)

# Accessing the second element of the preds tuple for test predictions
test_predictions_dict = preds[1]

xgb_test_predictions = test_predictions_dict['predictions_test_xgb']
rf_test_predictions = test_predictions_dict['predictions_test_rf']
svm_test_predictions = test_predictions_dict['predictions_test_svm']

xgb_test_prob_positive = np.array(xgb_test_predictions[0])
rf_test_prob_positive = np.array(rf_test_predictions[0])
svm_test_prob_positive = np.array(svm_test_predictions[0])

auc_xgb = roc_auc_score(y_test, xgb_test_prob_positive)
print(f'AUC for XGBoost: {auc_xgb:.3f}')

# Calculating AUC for Random Forest
auc_rf = roc_auc_score(y_test, rf_test_prob_positive)
print(f'AUC for Random Forest: {auc_rf:.3f}')

# Calculating AUC for SVM
auc_svm = roc_auc_score(y_test, svm_test_prob_positive)
print(f'AUC for SVM: {auc_svm:.3f}')

RESULT

image

nathan-jiang commented 6 months ago

"predictions_val_xgb = [np.random.rand(len(y_valid_dict[seed])) for seed in range(1, 6)] # Replace with actual predictions predictions_val_rf = [np.random.rand(len(y_valid_dict[seed])) for seed in range(1, 6)] # Replace with actual predictions predictions_val_svm = [np.random.rand(len(y_valid_dict[seed])) for seed in range(1, 6)] # Replace with actual predictions"

Be sure to replace these with your actual prediction csv files. For example, to get the 0.920 AUROC, you need to use RDKit 2D descriptors to encode the original data and then perform the model training and prediction to get your prediction results ready as inputs for the 'cfafunctions'.