ersilia-os / eos9f6t

Predicts the inhibitory activity of small molecules on SARS-Cov1 3CLprotease
GNU General Public License v3.0
0 stars 1 forks source link

Clean UP & Dockerization eos9f6t #1

Closed GemmaTuron closed 12 months ago

HellenNamulinda commented 1 year ago

Hi @GemmaTuron, This model was not working initially. For fetching, the error was not good enough to identify the problem. eos9f6t_fetch_repo.log

Model API eos9f6t:predict did not produce an output/home/hellenah/eos/repository/eos9f6t/20230713134942_47AFAA/eos9f6t/artifacts/framework/save_features.py: line 1: Computes and saves molecular features for a dataset.: command not found
/home/hellenah/eos/repository/eos9f6t/20230713134942_47AFAA/eos9f6t/artifacts/framework/save_features.py: line 3: from: command not found
/home/hellenah/eos/repository/eos9f6t/20230713134942_47AFAA/eos9f6t/artifacts/framework/save_features.py: line 4: import: command not found
/home/hellenah/eos/repository/eos9f6t/20230713134942_47AFAA/eos9f6t/artifacts/framework/save_features.py: line 5: import: command not found
/home/hellenah/eos/repository/eos9f6t/20230713134942_47AFAA/eos9f6t/artifacts/framework/save_features.py: line 6: import: command not found
/home/hellenah/eos/repository/eos9f6t/20230713134942_47AFAA/eos9f6t/artifacts/framework/save_features.py: line 7: from: command not found
/home/hellenah/eos/repository/eos9f6t/20230713134942_47AFAA/eos9f6t/artifacts/framework/save_features.py: line 9: from: command not found
/home/hellenah/eos/repository/eos9f6t/20230713134942_47AFAA/eos9f6t/artifacts/framework/save_features.py: line 10: from: command not found
/home/hellenah/eos/repository/eos9f6t/20230713134942_47AFAA/eos9f6t/artifacts/framework/save_features.py: line 12: syntax error near unexpected token `os.path.dirname'
/home/hellenah/eos/repository/eos9f6t/20230713134942_47AFAA/eos9f6t/artifacts/framework/save_features.py: line 12: `sys.path.append(os.path.dirname(os.path.dirname(os.path.realpath(__file__))))'

After running one of the Python files specified service.py, the error was related to the incompatibility of tensorbodX and protobuf. Bentoml depends on protobuf<3.19,>=3.8.0, but the initial requirements were installing tensorboardx 2.6.1 while installing chemprop==1.3.0, which caused a dependency error tensorboardx 2.6.1 requires protobuf>=4.22.3, but you have protobuf 3.18.3 which is incompatible.

(eos9f6t) hellenah@hellenah-elitebook:~/Outreachy/eos9f6t/model/framework$ python save_features.py --data_path ~/test.csv --save_path feats --features_generator rdkit_2d_normalized
Traceback (most recent call last):
  File "save_features.py", line 14, in <module>
    from chemprop.data import get_smiles
  File "/home/hellenah/anaconda3/envs/eos9f6t/lib/python3.7/site-packages/chemprop/__init__.py", line 4, in <module>
    import chemprop.train
  File "/home/hellenah/anaconda3/envs/eos9f6t/lib/python3.7/site-packages/chemprop/train/__init__.py", line 1, in <module>
    from .cross_validate import chemprop_train, cross_validate, TRAIN_LOGGER_NAME
  File "/home/hellenah/anaconda3/envs/eos9f6t/lib/python3.7/site-packages/chemprop/train/cross_validate.py", line 12, in <module>
    from .run_training import run_training
  File "/home/hellenah/anaconda3/envs/eos9f6t/lib/python3.7/site-packages/chemprop/train/run_training.py", line 8, in <module>
    from tensorboardX import SummaryWriter
  File "/home/hellenah/anaconda3/envs/eos9f6t/lib/python3.7/site-packages/tensorboardX/__init__.py", line 5, in <module>
    from .torchvis import TorchVis
  File "/home/hellenah/anaconda3/envs/eos9f6t/lib/python3.7/site-packages/tensorboardX/torchvis.py", line 10, in <module>
    from .writer import SummaryWriter
  File "/home/hellenah/anaconda3/envs/eos9f6t/lib/python3.7/site-packages/tensorboardX/writer.py", line 16, in <module>
    from .comet_utils import CometLogger
  File "/home/hellenah/anaconda3/envs/eos9f6t/lib/python3.7/site-packages/tensorboardX/comet_utils.py", line 7, in <module>
    from .summary import _clean_tag
  File "/home/hellenah/anaconda3/envs/eos9f6t/lib/python3.7/site-packages/tensorboardX/summary.py", line 12, in <module>
    from .proto.summary_pb2 import Summary
  File "/home/hellenah/anaconda3/envs/eos9f6t/lib/python3.7/site-packages/tensorboardX/proto/summary_pb2.py", line 5, in <module>
    from google.protobuf.internal import builder as _builder
ImportError: cannot import name 'builder' from 'google.protobuf.internal' (/home/hellenah/anaconda3/envs/eos9f6t/lib/python3.7/site-packages/google/protobuf/internal/__init__.py)

So, the tensorboardX version for chemprop1.3 was specified (pip install tensorboardX==2.0) and it resolved the issue.

(eos9f6t) hellenah@hellenah-elitebook:~/Outreachy/eos9f6t/model/framework$ python save_features.py --data_path ~/test.csv --save_path feats --features_generator rdkit_2d_normalized
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 33.55it/s]
(eos9f6t) hellenah@hellenah-elitebook:~/Outreachy/eos9f6t/model/framework$ python predict.py --test_path ~/test.csv --checkpoint_dir ~/Outreachy/eos9f6t/model/SARSBalanced --preds_path out.csv --features_path feats.npz --no_features_scaling

test: test.csv output: out.csv

Creating the run.sh file Because of this code that was in service.py, the run.sh file has two commands.

with open(run_file, "w") as f:
            lines = []
            lines += [
                "python {0}/save_features.py --data_path {1} --save_path {2} --features_generator rdkit_2d_normalized".format(
                    self.framework_dir, data_file, feat_file
                )
            ]
            lines += [
                "python {0}/predict.py --test_path {1} --checkpoint_dir {2} --preds_path {3} --features_path {4} --no_features_scaling".format(
                    self.framework_dir,
                    data_file,
                    self.checkpoints_dir,
                    pred_file,
                    feat_file,
                )
            ]
            f.write(os.linesep.join(lines))
        cmd = "bash {0}".format(run_file)

For the two python files(save_features.py and predict.py, --data_path and --test_path take the same argument, data_file. Also, --save_path and --features_path take the same file; feat_file.

Run.sh file;

python $1/code/save_features.py --data_path $2 --save_path 'features.npz' --features_generator rdkit_2d_normalized
python $1/code/predict.py --no_features_scaling --features_path 'features.npz' --test_path $2 --checkpoint_dir $3 --preds_path $4

Testing this works well when I use both run.sh and fetch within ersilia Run.sh: output_run.csv (eos9f6t) hellenah@hellenah-elitebook:~/Outreachy/eos9f6t/model/framework$ bash run.sh . ~/test.csv ~/Outreachy/eos9f6t/model/checkpoints/SARSBalanced output_run.csv

CLI 9f6t_cli_output.csv

🚀 Serving model eos9f6t: chemprop-sars-cov-inhibition

   URL: http://127.0.0.1:47443
   PID: 23375
   SRV: conda

👉 To run model:
   - run

💁 Information:
   - info
(ersilia) hellenah@hellenah-elitebook:~$ ersilia run -i "FC(F)Oc1ccc(-c2nnc3cncc(Oc4ccc5ccsc5c4)n23)cc1"
{
    "input": {
        "key": "MLBNXJTXHVBPEC-UHFFFAOYSA-N",
        "input": "FC(F)Oc1ccc(-c2nnc3cncc(Oc4ccc5ccsc5c4)n23)cc1",
        "text": "FC(F)Oc1ccc(-c2nnc3cncc(Oc4ccc5ccsc5c4)n23)cc1"
    },
    "output": {
        "activity": 0.3765662431716919
    }
}

The PR was created.

GemmaTuron commented 1 year ago

I am tagging @ZakiaYahya here because she was also experiencing issues with protobuf versions

HellenNamulinda commented 1 year ago

@GemmaTuron, We have a problem. While run.sh returns consistent values for the same file on different runs, the values returned using run.sh are different from the values when I serve the model within ersilia. output_run.csv 9f6t_cli_output.csv

Also, I have just tested the model using Colab. The Colab output is different. eos9f6t_colab_output.csv

I understand this model saves features every time before making predictions. But if the smiles in the file are the same, I don't get why the output values are different.

I wanted to check the original code, but the site is not loading http://chemprop.csail.mit.edu/checkpoints

GemmaTuron commented 1 year ago

Hi @HellenNamulinda

Before going into this with detail, can you explain the changes in the predict.py file? Why did you add the removal of saved features, which in principle is created in a temporal directoly only?

And then, instead of trying the original code, could you try the code before refactoring to see what we get then? Just clone the repo at the latest commit in history before changing it - you can do the necessary changes in the dockerfile though

HellenNamulinda commented 1 year ago

Hi @HellenNamulinda

Before going into this with detail, can you explain the changes in the predict.py file? Why did you add the removal of saved features, which in principle is created in a temporal directoly only?

And then, instead of trying the original code, could you try the code before refactoring to see what we get then? Just clone the repo at the latest commit in history before changing it - you can do the necessary changes in the dockerfile though

Hi @GemmaTuron Sorry I hadn't explained this in detail.

save_features.py expects save_path: str # Path to .npz file where features will be saved as a compressed numpy archive For the temp_dir; this is for storing temporary .npz files containing features for each molecule, and it is deleted after generating the features for all the molecules. So, the final features for all molecules will be saved in the feat_file provided as an argument(save_path), and it is this feat_file that is used to make predictions by predict.py(--features_path {4})

From the previous code in service.py, save_features.py saves features for the data_file in feat_file(sav_path). And predict.py takes this feat_file to make predictions on the data_file.

lines += [
                "python {0}/save_features.py --data_path {1} --save_path {2} --features_generator rdkit_2d_normalized".format(
                    self.framework_dir, data_file, feat_file
                )
            ]
lines += [
                "python {0}/predict.py --test_path {1} --checkpoint_dir {2} --preds_path {3} --features_path {4} --no_features_scaling".format(
                    self.framework_dir,
                    data_file,
                    self.checkpoints_dir,
                    pred_file,
                    feat_file,
                )
            ]

The problem here is; we either have to keep changing the save_path for each run, or features generated from the previous data will be used. If we don't remove the feat_file for the first data _file used and we run the command for the second time, it will give a value error; ValueError: "features.npz" already exists and args.restart is False, and therefore features generated from the previous data will be used.

Saving predictions to output_run2.csv                                                                                            
Elapsed time = 0:00:02
(eos9f6t) hellenah@hellenah-elitebook:~/Outreachy/eos9f6t/model/framework$ bash run.sh . ~/eml_50_1.csv  ~/Outreachy/eos9f6t/model/checkpoints/SARSBalanced eml_output_run2.csv
now getting features
Traceback (most recent call last):
  File "./code/save_features.py", line 117, in <module>
    generate_and_save_features(Args().parse_args())
  File "./code/save_features.py", line 76, in generate_and_save_features
    raise ValueError(f'"{args.save_path}" already exists and args.restart is False.')
ValueError: "features.npz" already exists and args.restart is False.

Also, if the next file has a different number of smiles, predict.py will raise an index error; forexample test.csv has 10 smiles and eml_50.csv has 50 smiles. This causes an index error; IndexError: index 10 is out of bounds for axis 0 with size 10 because the data has more number of sample/smiles than there are in the features.

ValueError: "features.npz" already exists and args.restart is False.
Loading training args
Loading data
10it [00:00, 37349.10it/s]
Traceback (most recent call last):
  File "./code/predict.py", line 10, in <module>
    chemprop_predict() 
  File "/home/hellenah/anaconda3/envs/eos9f6t/lib/python3.7/site-packages/chemprop/train/make_predictions.py", line 176, in chemprop_predict
    make_predictions(args=PredictArgs().parse_args())
  File "/home/hellenah/anaconda3/envs/eos9f6t/lib/python3.7/site-packages/chemprop/utils.py", line 437, in wrap
    result = func(*args, **kwargs)
  File "/home/hellenah/anaconda3/envs/eos9f6t/lib/python3.7/site-packages/chemprop/train/make_predictions.py", line 53, in make_predictions
    skip_invalid_smiles=False, args=args, store_row=not args.drop_extra_columns)
  File "/home/hellenah/anaconda3/envs/eos9f6t/lib/python3.7/site-packages/chemprop/data/utils.py", line 245, in get_data
    all_features.append(features_data[i])
IndexError: index 10 is out of bounds for axis 0 with size 10

Changing the save_path for the features each time means creating multiple features files and hence consuming storage.

HellenNamulinda commented 1 year ago

Hi @GemmaTuron, I at first confused the temporary directory you were referring to.

After resetting the code to a version before the new changes, I was finally able to identify the cause of the variations(which were coming from service.py) and resolved it.

The code before was working with ersilia because it uses temporary directories which are different every run time, even though the name of the features_file remains the same. 9f6t_cli__original_output.csv

    def __init__(self):
        self.DATA_FILE = "data.csv"
        self.FEAT_FILE = "features.npz"
        self.PRED_FILE = "pred.csv"
        self.RUN_FILE = "run.sh"
..
    def predict(self, smiles_list):
        tmp_folder = tempfile.mkdtemp()
        data_file = os.path.join(tmp_folder, self.DATA_FILE)
        feat_file = os.path.join(tmp_folder, self.FEAT_FILE)
        pred_file = os.path.join(tmp_folder, self.PRED_FILE)

However, running the code directly in the model folder still required every run to change the --save_path, where features are saved or first delete the previous file if the same name. otherwise, we would get the value error; ValueError: "features.npz" already exists and args.restart is False.

Previously, I had made the feature file(features.npz) static in run.sh(because I thought reducing the number of args would make running code easier), but apparently, this led to variations in the ersilia output. (This needed to be dynamic, and provided by ersilia). Passing it as an argument works fine and the results are consistent both when using run.sh and within ersilia.

 def run(self, input_list):
        tmp_folder = tempfile.mkdtemp(prefix="eos-")
        data_file = os.path.join(tmp_folder, self.DATA_FILE)
        feat_file = os.path.join(tmp_folder, self.FEAT_FILE)
        output_file = os.path.join(tmp_folder, self.OUTPUT_FILE)
        log_file = os.path.join(tmp_folder, self.LOG_FILE)
...

with open(run_file, "w") as f:
            lines = [
                "bash {0}/run.sh {0} {1} {2} {3} {4}".format(
                        self.framework_dir,
                        data_file,
                        self.checkpoints_dir,
                        output_file,
                        feat_file
                    )
                ] 

So, using run.sh will require the 5 args, as specified in service.py

python $1/code/save_features.py --data_path $2 --save_path $5 --features_generator rdkit_2d_normalized
python $1/code/predict.py --no_features_scaling --test_path $2 --checkpoint_dir $3 --preds_path $4 --features_path $5

The removal of saved features after making predictions in predict.py solves the ValueError when using ru.sh and passing the same path for saving features(this is an intermediate file). This doesn't affect the model performance or lead to inconsistent output.

Input file: test.csv Run.sh: test_output_run.csv CLI: 9f6t_cli_output2.csv

EML Dataset (only smiles column) Input file: eml_canonical_1.csv Run.sh: eml_output_run.csv CLI: 9f6t_cli_eml_output.csv

The PR was created for these changes

GemmaTuron commented 1 year ago

Hi @HellenNamulinda !!

Great work, sorry if my indications were confusing. I'll look at the PR!

HellenNamulinda commented 1 year ago

Hello @GemmaTuron, The temporary directory used by ersilia is created by service.py I had confused it with the temp directory created by this model's code(save_features.py)

It's all good now. Thank you for the guidance always :clap:

GemmaTuron commented 12 months ago

@HellenNamulinda

The workflows are updated but they fail at fetching the model, please check, thanks!

HellenNamulinda commented 12 months ago

Hi @GemmaTuron, The error was descriptor '__init__' requires a 'Exception' object but received a 'str', which is the same in model eos43at, also Feb got the same error with models eos85a3 and eos1af5

I hadn't set a specific version of rdkit. And it was installed twice in the same environment. at step 362 and at 632

Specifying the rdkit version to 2022.9.5 didn't help, two versions were being installed in the same environment: Downloading rdkit_pypi-2022.9.5 and Downloading rdkit-2023.3.2. I noticed the second version was being installed after descriptastorus

On checking the repository for descriptastorus, it was updated 4 days ago. So, installing the version before the recent changes solved the issue(locally) From RUN pip install git+https://github.com/bp-kelley/descriptastorus to RUN pip install git+https://github.com/bp-kelley/descriptastorus.git@86eedc60546abe6f59cdbcb12025a61157ba178d

With the new changes, the model works well locally, but The Model Test on PR failed while checking the metadata

GemmaTuron commented 12 months ago

Good catch @HellenNamulinda ! and an issue because descriptastorus is used by other models, I hope they have the version fixed as well In principle the updated workflows work, I've merged the PR