Closed GemmaTuron closed 12 months ago
I am tagging @ZakiaYahya here because she was also experiencing issues with protobuf versions
@GemmaTuron, We have a problem. While run.sh returns consistent values for the same file on different runs, the values returned using run.sh are different from the values when I serve the model within ersilia. output_run.csv 9f6t_cli_output.csv
Also, I have just tested the model using Colab. The Colab output is different. eos9f6t_colab_output.csv
I understand this model saves features every time before making predictions. But if the smiles in the file are the same, I don't get why the output values are different.
I wanted to check the original code, but the site is not loading http://chemprop.csail.mit.edu/checkpoints
Hi @HellenNamulinda
Before going into this with detail, can you explain the changes in the predict.py
file? Why did you add the removal of saved features, which in principle is created in a temporal directoly only?
And then, instead of trying the original code, could you try the code before refactoring to see what we get then? Just clone the repo at the latest commit in history before changing it - you can do the necessary changes in the dockerfile though
Hi @HellenNamulinda
Before going into this with detail, can you explain the changes in the
predict.py
file? Why did you add the removal of saved features, which in principle is created in a temporal directoly only?And then, instead of trying the original code, could you try the code before refactoring to see what we get then? Just clone the repo at the latest commit in history before changing it - you can do the necessary changes in the dockerfile though
Hi @GemmaTuron Sorry I hadn't explained this in detail.
save_features.py expects save_path: str # Path to .npz file where features will be saved as a compressed numpy archive
For the temp_dir
; this is for storing temporary .npz files containing features for each molecule, and it is deleted after generating the features for all the molecules.
So, the final features for all molecules will be saved in the feat_file
provided as an argument(save_path
), and it is this feat_file that is used to make predictions by predict.py(--features_path {4}
)
From the previous code in service.py, save_features.py saves features for the data_file in feat_file(sav_path). And predict.py takes this feat_file to make predictions on the data_file.
lines += [
"python {0}/save_features.py --data_path {1} --save_path {2} --features_generator rdkit_2d_normalized".format(
self.framework_dir, data_file, feat_file
)
]
lines += [
"python {0}/predict.py --test_path {1} --checkpoint_dir {2} --preds_path {3} --features_path {4} --no_features_scaling".format(
self.framework_dir,
data_file,
self.checkpoints_dir,
pred_file,
feat_file,
)
]
The problem here is; we either have to keep changing the save_path for each run, or features generated from the previous data will be used. If we don't remove the feat_file for the first data _file used and we run the command for the second time, it will give a value error; ValueError: "features.npz" already exists and args.restart is False
, and therefore features generated from the previous data will be used.
Saving predictions to output_run2.csv
Elapsed time = 0:00:02
(eos9f6t) hellenah@hellenah-elitebook:~/Outreachy/eos9f6t/model/framework$ bash run.sh . ~/eml_50_1.csv ~/Outreachy/eos9f6t/model/checkpoints/SARSBalanced eml_output_run2.csv
now getting features
Traceback (most recent call last):
File "./code/save_features.py", line 117, in <module>
generate_and_save_features(Args().parse_args())
File "./code/save_features.py", line 76, in generate_and_save_features
raise ValueError(f'"{args.save_path}" already exists and args.restart is False.')
ValueError: "features.npz" already exists and args.restart is False.
Also, if the next file has a different number of smiles, predict.py will raise an index error; forexample test.csv has 10 smiles and eml_50.csv has 50 smiles. This causes an index error; IndexError: index 10 is out of bounds for axis 0 with size 10
because the data has more number of sample/smiles than there are in the features.
ValueError: "features.npz" already exists and args.restart is False.
Loading training args
Loading data
10it [00:00, 37349.10it/s]
Traceback (most recent call last):
File "./code/predict.py", line 10, in <module>
chemprop_predict()
File "/home/hellenah/anaconda3/envs/eos9f6t/lib/python3.7/site-packages/chemprop/train/make_predictions.py", line 176, in chemprop_predict
make_predictions(args=PredictArgs().parse_args())
File "/home/hellenah/anaconda3/envs/eos9f6t/lib/python3.7/site-packages/chemprop/utils.py", line 437, in wrap
result = func(*args, **kwargs)
File "/home/hellenah/anaconda3/envs/eos9f6t/lib/python3.7/site-packages/chemprop/train/make_predictions.py", line 53, in make_predictions
skip_invalid_smiles=False, args=args, store_row=not args.drop_extra_columns)
File "/home/hellenah/anaconda3/envs/eos9f6t/lib/python3.7/site-packages/chemprop/data/utils.py", line 245, in get_data
all_features.append(features_data[i])
IndexError: index 10 is out of bounds for axis 0 with size 10
Changing the save_path for the features each time means creating multiple features files and hence consuming storage.
Hi @GemmaTuron, I at first confused the temporary directory you were referring to.
After resetting the code to a version before the new changes, I was finally able to identify the cause of the variations(which were coming from service.py) and resolved it.
The code before was working with ersilia because it uses temporary directories which are different every run time, even though the name of the features_file remains the same. 9f6t_cli__original_output.csv
def __init__(self):
self.DATA_FILE = "data.csv"
self.FEAT_FILE = "features.npz"
self.PRED_FILE = "pred.csv"
self.RUN_FILE = "run.sh"
..
def predict(self, smiles_list):
tmp_folder = tempfile.mkdtemp()
data_file = os.path.join(tmp_folder, self.DATA_FILE)
feat_file = os.path.join(tmp_folder, self.FEAT_FILE)
pred_file = os.path.join(tmp_folder, self.PRED_FILE)
However, running the code directly in the model folder still required every run to change the --save_path
, where features are saved or first delete the previous file if the same name. otherwise, we would get the value error; ValueError: "features.npz" already exists and args.restart is False.
Previously, I had made the feature file(features.npz
) static in run.sh(because I thought reducing the number of args would make running code easier), but apparently, this led to variations in the ersilia output. (This needed to be dynamic, and provided by ersilia).
Passing it as an argument works fine and the results are consistent both when using run.sh and within ersilia.
def run(self, input_list):
tmp_folder = tempfile.mkdtemp(prefix="eos-")
data_file = os.path.join(tmp_folder, self.DATA_FILE)
feat_file = os.path.join(tmp_folder, self.FEAT_FILE)
output_file = os.path.join(tmp_folder, self.OUTPUT_FILE)
log_file = os.path.join(tmp_folder, self.LOG_FILE)
...
with open(run_file, "w") as f:
lines = [
"bash {0}/run.sh {0} {1} {2} {3} {4}".format(
self.framework_dir,
data_file,
self.checkpoints_dir,
output_file,
feat_file
)
]
So, using run.sh will require the 5 args, as specified in service.py
python $1/code/save_features.py --data_path $2 --save_path $5 --features_generator rdkit_2d_normalized
python $1/code/predict.py --no_features_scaling --test_path $2 --checkpoint_dir $3 --preds_path $4 --features_path $5
The removal of saved features after making predictions in predict.py solves the ValueError
when using ru.sh and passing the same path for saving features(this is an intermediate file). This doesn't affect the model performance or lead to inconsistent output.
Input file: test.csv Run.sh: test_output_run.csv CLI: 9f6t_cli_output2.csv
EML Dataset (only smiles column) Input file: eml_canonical_1.csv Run.sh: eml_output_run.csv CLI: 9f6t_cli_eml_output.csv
The PR was created for these changes
Hi @HellenNamulinda !!
Great work, sorry if my indications were confusing. I'll look at the PR!
Hello @GemmaTuron, The temporary directory used by ersilia is created by service.py I had confused it with the temp directory created by this model's code(save_features.py)
It's all good now. Thank you for the guidance always :clap:
@HellenNamulinda
The workflows are updated but they fail at fetching the model, please check, thanks!
Hi @GemmaTuron,
The error was descriptor '__init__' requires a 'Exception' object but received a 'str'
, which is the same in model eos43at, also Feb got the same error with models eos85a3 and eos1af5
I hadn't set a specific version of rdkit. And it was installed twice in the same environment. at step 362 and at 632
Specifying the rdkit version to 2022.9.5
didn't help, two versions were being installed in the same environment: Downloading rdkit_pypi-2022.9.5 and Downloading rdkit-2023.3.2. I noticed the second version was being installed after descriptastorus
On checking the repository for descriptastorus, it was updated 4 days ago.
So, installing the version before the recent changes solved the issue(locally)
From RUN pip install git+https://github.com/bp-kelley/descriptastorus
to RUN pip install git+https://github.com/bp-kelley/descriptastorus.git@86eedc60546abe6f59cdbcb12025a61157ba178d
With the new changes, the model works well locally, but The Model Test on PR failed while checking the metadata
Good catch @HellenNamulinda ! and an issue because descriptastorus is used by other models, I hope they have the version fixed as well In principle the updated workflows work, I've merged the PR
Hi @GemmaTuron, This model was not working initially. For fetching, the error was not good enough to identify the problem. eos9f6t_fetch_repo.log
After running one of the Python files specified service.py, the error was related to the incompatibility of tensorbodX and protobuf. Bentoml depends on protobuf<3.19,>=3.8.0, but the initial requirements were installing tensorboardx 2.6.1 while installing
chemprop==1.3.0
, which caused a dependency errortensorboardx 2.6.1 requires protobuf>=4.22.3, but you have protobuf 3.18.3 which is incompatible.
So, the tensorboardX version for chemprop1.3 was specified (
pip install tensorboardX==2.0
) and it resolved the issue.test: test.csv output: out.csv
Creating the run.sh file Because of this code that was in service.py, the run.sh file has two commands.
For the two python files(save_features.py and predict.py,
--data_path
and--test_path
take the same argument,data_file
. Also,--save_path
and--features_path
take the same file;feat_file
.Run.sh file;
Testing this works well when I use both run.sh and fetch within ersilia Run.sh: output_run.csv
(eos9f6t) hellenah@hellenah-elitebook:~/Outreachy/eos9f6t/model/framework$ bash run.sh . ~/test.csv ~/Outreachy/eos9f6t/model/checkpoints/SARSBalanced output_run.csv
CLI 9f6t_cli_output.csv
The PR was created.