ATOMScience-org / AMPL

The ATOM Modeling PipeLine (AMPL) is an open-source, modular, extensible software pipeline for building and sharing models to advance in silico drug discovery.
MIT License
136 stars 67 forks source link

ValueError: y has more than n_class unique elements. #173

Closed mmagithub closed 2 years ago

mmagithub commented 2 years ago

Hi, I am trying to run AMPL on a sample dataset for classification. The NN classification model keeps giving me this weird "ValueError: y has more than n_class unique elements." even though the RF & XGBOOST classification models finish without issues.

Any clue for the possible reasons for this error?

Changing deepchem versions did not help.

Thanks, Marawan

stewarthe6 commented 2 years ago

Hi Marawan,

Does your dataset have more than two classes? We've been talking about adding more classes, but AMPL currently only supports binary classification.

Stewart

mmagithub commented 2 years ago

I see, yes, my dataset has more than two classes (3-4), but now I am confused about why did the XGBOOST and RF finish without errors, or you mean it is only an issue in the NN classification? Indeed I could see a number of classes = 2 as a hard-coded parameter.

Is there any workaround to force AMPL to work in multi-class classification mode?

paulsonak commented 2 years ago

Hi Marawan, This is something we recently discussed adding but haven't had time to do it. We'd welcome a contribution from you if you are interested! I think we'd need at minimum: Add the class_number parameter to DeepChem classifier call (model_pipeline.py?, model_wrapper.py) and update metrics to work with multiclass classification and/or not be calculated if they are binary-only (perf_data.py). There might be more changes required but those are the ones I can think of right now. Thanks! Amanda

paulsonak commented 2 years ago

@stewarthe6 and I dug a little deeper - I think you should be able to use class_number in your config.json file and get the NN models to run, but the metrics will still be inaccurate. Then, an error might be thrown when the functions attempt to compute the binary classification metrics on >2 classes. This part would require updating no matter what.

mmagithub commented 2 years ago

Thanks paulsonac, I will give it a try. If the issue is only in the metrics, which I believe is standard anyway, does pulling another metrics.py script for example the one shipped with deepchem can solve this, or metrics calculations are attached to some other parts of the codes that will mess things up ?

paulsonak commented 2 years ago

I'd start with double checking the error after defining class_number in your config file. Then, tracing the code through perf_data.py. Most of our metrics are pulled directly from sklearn and computed during training and at the end of training. It's definitely possible to get the predictions & compute your own metrics after training the model, but if you're not able to complete training then there would have to be some troubleshooting.

Feel free to post a minimally reproducible example and I will take a closer look too. Thanks!

mmagithub commented 2 years ago

Hi paulsonac, here is a sample of the data and the script I am using.

data_ready_to_model.csv

code: #########################

!/usr/bin/env python3

-- coding: utf-8 --

import atomsci.ddm.utils.data_curation_functions as dcf import atomsci.ddm.utils.curate_data as curate_data import atomsci.ddm.pipeline.diversity_plots as dp import atomsci.ddm.pipeline.chem_diversity as cd

from atomsci.ddm.pipeline import predict_from_model

from matplotlib_venn import venn2

import getpass, os import pandas as pd import matplotlib.pyplot as plt import atomsci.ddm.utils.curate_data as curate_data import atomsci.ddm.pipeline.chem_diversity as cd import atomsci.ddm.pipeline.model_pipeline as mp import atomsci.ddm.pipeline.parameter_parser as parse import atomsci.ddm.utils.curate_data as curate_data import atomsci.ddm.utils.struct_utils as struct_utils from atomsci.ddm.pipeline import perf_plots as pp from atomsci.ddm.pipeline import predict_from_model as pfm

response_col = "label" compound_id = "smile_id" smiles_col = "smiles"

train_config_rf = { "verbose": "True", "system": "LC", "lc_account": 'None', "datastore": "False", "save_results": "False", "data_owner": "username", "prediction_type": "classification", "dataset_key": "./data_ready_to_model.csv", "id_col": compound_id, "smiles_col": smiles_col, "response_cols": response_col, "previously_split": "False", "split_only": "False", "featurizer": "ecfp", "model_type": "RF", "verbose": "True", "transformers": "True", 'max_epochs': '70', "result_dir": "./content/RF", "splitter": "scaffold", "split_valid_frac": "0.15", "split_test_frac": "0.15"}

train_params_rf = parse.wrapper(train_config_rf) train_model_rf = mp.ModelPipeline(train_params_rf) train_model_rf.train_model() ################################################################################################################################################ print('################################################################################################################################################') print('Now I finished RF, starting XGBOOST') print('################################################################################################################################################') ################################################################################################################################################ train_config_xgboost = { "verbose": "True", "system": "LC", "lc_account": 'None', "datastore": "False", "save_results": "False", "data_owner": "username", "prediction_type": "classification", "dataset_key": "./data_ready_to_model.csv", "id_col": compound_id, "smiles_col": smiles_col, "response_cols": response_col, "previously_split": "False", "split_only": "False", "featurizer": "ecfp", "model_type": "xgboost", "verbose": "True", "transformers": "True", 'max_epochs': '70', "result_dir": "./content/xgboost", "splitter": "scaffold", "split_valid_frac": "0.15", "split_test_frac": "0.15" }

train_params_xgboost = parse.wrapper(train_config_xgboost) train_model_xgboost = mp.ModelPipeline(train_params_xgboost) train_model_xgboost.train_model() ################################################################################################################################################ print('################################################################################################################################################') print('Now I finished XGBOOST, starting NN') print('################################################################################################################################################') ################################################################################################################################################

train_config_NN = { "verbose": "True", "system": "LC", "lc_account": 'None', "datastore": "False", "save_results": "False", "data_owner": "username", "prediction_type": "classification", "dataset_key": "./gpcr_input_df.csv", "id_col": compound_id, "smiles_col": smiles_col, "response_cols": response_col, "previously_split": "False", "split_only": "False", "featurizer": "ecfp", "model_type": "NN", "verbose": "True", "transformers": "True", 'max_epochs': '70', "result_dir": "./content/NN", "splitter": "scaffold", "split_valid_frac": "0.15", "split_test_frac": "0.15" }

train_params_NN = parse.wrapper(train_config_NN) train_model_NN = mp.ModelPipeline(train_params_NN) train_model_NN.train_model()

paulsonak commented 2 years ago

Hi @mmagithub I was able to reproduce your error. I added the line "class_number":4 to your NN config dict and was able to successfully train a model:

train_config_NN = {
"verbose": "True",
"system": "LC",
"lc_account": 'None',
"datastore": "False",
"save_results": "False",
"data_owner": "username",
"prediction_type": "classification",
"dataset_key": "./data_ready_to_model.csv",
"id_col": compound_id,
"smiles_col": smiles_col,
"response_cols": response_col,
"previously_split": "False",
"split_only": "False",
"featurizer": "ecfp",
"model_type": "NN",
"verbose": "True",
"transformers": "True",
'max_epochs': '70',
"result_dir": "./content/NN",
"splitter": "scaffold",
"split_valid_frac": "0.15",
"split_test_frac": "0.15",
"class_number":4
}
mmagithub commented 2 years ago

Interesting, thanks paulsonac. Now do the resulting performance measures will do make sense, auc-roc, Fscores, MCC scores, etc or the way it is implemented right now will make it correct only for binary classification problems?

paulsonak commented 2 years ago

It looks like they should all be calculated correctly, with the exception of negative predictive value which is only defined for binary models. See here for the code that calculates the scores. I checked with some models based on your code and it looks reasonable, npv is indeed not calculated.

image

Let us know if you have more questions!

mmagithub commented 2 years ago

did you manage to run the NN code till the end, I got a checkpoint read error with this config. I posted a new thread.