Duplicate parameters between improvelib and additional definitions - expected behavior?

jonesse3 commented 3 weeks ago

Tested with LGBM (note the DEBUG is default):

DEBUG Config 2024-08-21 09:26:08,500:   Setting output directory
DEBUG CLI 2024-08-21 09:26:08,500:      Setting Command Line Options
DEBUG CLI 2024-08-21 09:26:08,500:      Group: IMPROVE options
DEBUG CLI 2024-08-21 09:26:08,500:      Setting Group to <argparse._ArgumentGroup object at 0x7f1cf18d3ed0>
DEBUG CLI 2024-08-21 09:26:08,500:      Setting Command Line Options
DEBUG CLI 2024-08-21 09:26:08,500:      Group: Train stage options
DEBUG CLI 2024-08-21 09:26:08,500:      Setting Group to <argparse._ArgumentGroup object at 0x7f1cf18df590>
DEBUG CLI 2024-08-21 09:26:08,501:      Setting Command Line Options
DEBUG CLI 2024-08-21 09:26:08,501:      Group: Drug Response Prediction Training
DEBUG CLI 2024-08-21 09:26:08,501:      Setting Group to <argparse._ArgumentGroup object at 0x7f1cf18dff50>
additional_definitions(stage): [{'name': 'use_lincs', 'type': <function str2bool at 0x7f1cf1996c20>, 'default': True, 'help': 'Flag to indicate if landmark genes are used for gene selection.'}, {'name': 'scaling', 'type': <class 'str'>, 'default': 'std', 'choice': ['std', 'minmax', 'miabs', 'robust'], 'help': 'Scaler for gene expression and Mordred descriptors data.'}, {'name': 'ge_scaler_fname', 'type': <class 'str'>, 'default': 'x_data_gene_expression_scaler.gz', 'help': 'File name to save the gene expression scaler object.'}, {'name': 'md_scaler_fname', 'type': <class 'str'>, 'default': 'x_data_mordred_scaler.gz', 'help': 'File name to save the Mordred scaler object.'}, {'name': 'learning_rate', 'type': <class 'float'>, 'default': 0.1, 'help': 'learning rate test'}, {'name': 'n_estimators', 'type': <class 'int'>, 'default': 1000, 'help': 'Number of estimators.'}, {'name': 'max_depth', 'type': <class 'int'>, 'default': -1, 'help': 'Max depth.'}, {'name': 'num_leaves', 'type': <class 'int'>, 'default': 31, 'help': 'Number of leaves.'}]
DEBUG CLI 2024-08-21 09:26:08,501:      Setting Command Line Options
DEBUG CLI 2024-08-21 09:26:08,501:      Group: Additional Parameters options
WARNING CLI 2024-08-21 09:26:08,501:    Found learning_rate in options. This option is predefined and can not be overwritten.
DEBUG CLI 2024-08-21 09:26:08,501:      Removing learning_rate from options
DEBUG CLI 2024-08-21 09:26:08,501:      Setting Group to <argparse._ArgumentGroup object at 0x7f1d6c24d510>
DEBUG Train 2024-08-21 09:26:08,501:    Initializing parameters for Train
DEBUG CLI 2024-08-21 09:26:08,501:      Setting Command Line Options
DEBUG CLI 2024-08-21 09:26:08,501:      Group: None
WARNING CLI 2024-08-21 09:26:08,501:    No options provided. Ignoring.
DEBUG CLI 2024-08-21 09:26:08,501:      Getting Command Line Options
DEBUG CLI 2024-08-21 09:26:08,501:      Determining Options set by the Command Line
explicit cli: {'log_level': False, 'input_dir': False, 'output_dir': False, 'config_file': False, 'param_log_file': False, 'data_format': False, 'model_file_name': False, 'model_file_format': False, 'epochs': False, 'learning_rate': False, 'batch_size': False, 'val_batch': False, 'loss': False, 'early_stop_metric': False, 'patience': False, 'y_data_preds_suffix': False, 'json_scores_suffix': False, 'pred_col_name_suffix': False, 'y_col_name': False, 'y_data_suffix': False, 'use_lincs': False, 'scaling': False, 'ge_scaler_fname': False, 'md_scaler_fname': False, 'n_estimators': False, 'max_depth': False, 'num_leaves': False}
DEBUG Train 2024-08-21 09:26:08,502:    Loading configuration file
DEBUG Train 2024-08-21 09:26:08,502:    No config file provided. Using default: /nfs/lambda_stor_01/homes/ac.sejones/test_improve_library/LGBM/lgbm_params.txt
INFO Train 2024-08-21 09:26:08,503:     Loading config from /nfs/lambda_stor_01/homes/ac.sejones/test_improve_library/LGBM/lgbm_params.txt
DEBUG Train 2024-08-21 09:26:08,503:    Default parameters: {'log_level': 'DEBUG', 'input_dir': './', 'output_dir': './', 'config_file': None, 'param_log_file': 'param_log_file.txt', 'data_format': '.parquet', 'model_file_name': 'model', 'model_file_format': '.pt', 'epochs': 7, 'learning_rate': 7, 'batch_size': 7, 'val_batch': 64, 'loss': 'mse', 'early_stop_metric': 'mse', 'patience': 20, 'y_data_preds_suffix': 'predicted', 'json_scores_suffix': 'scores', 'pred_col_name_suffix': '_pred', 'y_col_name': 'auc', 'y_data_suffix': 'y_data', 'use_lincs': True, 'scaling': 'std', 'ge_scaler_fname': 'x_data_gene_expression_scaler.gz', 'md_scaler_fname': 'x_data_mordred_scaler.gz', 'n_estimators': 1000, 'max_depth': -1, 'num_leaves': 31}
DEBUG Train 2024-08-21 09:26:08,503:    Current section: Train
DEBUG Train 2024-08-21 09:26:08,504:    Current section config parameters: {'model_file_name': 'model', 'model_file_format': '.txt', 'n_estimators': '800', 'input_dir': './', 'output_dir': './'}
DEBUG Train 2024-08-21 09:26:08,504:    Updating config
INFO Train 2024-08-21 09:26:08,504:     Overriding model_file_name default with config value of model
INFO Train 2024-08-21 09:26:08,504:     Overriding model_file_format default with config value of .txt
INFO Train 2024-08-21 09:26:08,504:     Overriding n_estimators default with config value of 800
INFO Train 2024-08-21 09:26:08,504:     Overriding input_dir default with config value of ./
INFO Train 2024-08-21 09:26:08,504:     Overriding output_dir default with config value of ./
DEBUG Train 2024-08-21 09:26:08,504:    Current section CLI set parameters: {}
DEBUG Train 2024-08-21 09:26:08,504:    Final parameters: {'log_level': 'DEBUG', 'input_dir': './', 'output_dir': './', 'config_file': None, 'param_log_file': 'param_log_file.txt', 'data_format': '.parquet', 'model_file_name': 'model', 'model_file_format': '.txt', 'epochs': 7, 'learning_rate': 7, 'batch_size': 7, 'val_batch': 64, 'loss': 'mse', 'early_stop_metric': 'mse', 'patience': 20, 'y_data_preds_suffix': 'predicted', 'json_scores_suffix': 'scores', 'pred_col_name_suffix': '_pred', 'y_col_name': 'auc', 'y_data_suffix': 'y_data', 'use_lincs': True, 'scaling': 'std', 'ge_scaler_fname': 'x_data_gene_expression_scaler.gz', 'md_scaler_fname': 'x_data_mordred_scaler.gz', 'n_estimators': 800, 'max_depth': -1, 'num_leaves': 31}
DEBUG Train 2024-08-21 09:26:08,504:    Final parameters set.
DEBUG Train 2024-08-21 09:26:08,504:    Current log level is DEBUG
DEBUG Train 2024-08-21 09:26:08,504:    Saving final parameters to file.
xtr: (7616, 2571)
ytr: (7616, 1)
xvl: (952, 2571)
yvl: (952, 1)

IMPROVE_RESULT val_loss:        0.6393238524871944

Validation scores:
        {'mse': 0.6393238524871944, 'rmse': 0.7995772961303956, 'pcc': 0.8251719768910251, 'scc': 0.662553909807486, 'r2': -25.809588131160382, 'val_loss': 0.6393238524871944}

Finished model `training.`

What should we do if learning_rate, which is part of improvelib, is added as model-specific parameter @nkoussa , @adpartin , @priyanka9991 , @wilke ?

Options:

Throw out error and run code (in not debug format).
Throw out error and stop code.
Have users run some unit test to ensure their code is compliant first.
Have the error at the very end of stderr/stdout to make it clearer/stand out.

nkoussa commented 3 weeks ago

I vote 2. It's the safest.

rajeeja commented 3 weeks ago

I vote 2. It's the safest.

+1

adpartin commented 3 weeks ago

Let's go with option 2. This will ensure more robust models. I think option 3 does not contradict option 2 so we should include in the near future. Basically, we'll need to create unittests that inform curators of any issues in their models.

JDACS4C-IMPROVE / IMPROVE

Duplicate parameters between improvelib and additional definitions - expected behavior? #59