ersilia-os / eos8fth

GNU General Public License v3.0

0 stars 0 forks source link

Clean UP & Dockerization eos8fth #1

Closed GemmaTuron closed 1 year ago

simrantan commented 1 year ago

@GemmaTuron The model failed to be fetched on my device because of missing information in the metadata.json throwing an error - when I investigated the file I saw that it says the model is "in progress", not "ready" - is it just that the metadata needs to be updated?

GemmaTuron commented 1 year ago

@simrantan

This model was never completed. Would you take it up and work on it? Is referred to this issue: https://github.com/ersilia-os/ersilia/issues/657

simrantan commented 1 year ago

Yes! I will look into the issue and begin working on the model.

simrantan commented 1 year ago

The model appears to just be the template with no added functionality. I have spent a while reading the source code of the model as well as the model incorporation guide to understand how I can add functionality to this model since it is the first one I will be working on. I am still working on finding if I should link the ersilia model to the source code or replicate the source code functionality in the model. I am planning to continue working on this issue tomorrow.

GemmaTuron commented 1 year ago

Hi @simrantan

The steps for model incorporation are detailed here. The code for the model is referenced in the Model Request Issue: https://github.com/sirimullalab/redial-2020/tree/v1.0

It seems it is well documented. I'd suggest to start by the "Manual Start" installing the model in your local, then transfer the necessary files to the eos-repository.

simrantan commented 1 year ago

@GemmaTuron

I have installed the model and been working on running it to see how it works and what the necessary files are. I have a few questions I wanted to check in about before I transfer the files to the eos repository.

First, there are two files that could be added as the "model" from Redial-2020. One is run_prediction.py in the folder batch_screen, which takes in an input csv file of smiles and runs it on the numerous models in this repo. Based on the instructions in the original repo, I have been able to download the whole repo and run this file. However, the output of this file is about 15 different csv files since it is running multiple models. If we were to tranfer this file to the eos repository, we would also need to transfer all the models it is testing since this file cannot run properly without all the necessary models and their requirements.

The other file that could be used is run_script.py, which takes in one smile and returns a similarity dict and a prediction dict. I am trying to test this file in isolation (created a conda environment, individually downloaded files as necessary for functionality to see what is needed as instructed in the model incorporation workflow), but I am running into an issue. When running this model from the command line, the import statements appear to not run before main.py runs, which causes module_not_found or 'x not defined' errors even though the modules are imported/defined at the top of the script. So, when a module imported at the top is called in main.py, it throws an error and I can't see if it runs. I have looked online for solutions and the only suggestion appears to be edit the source code or run it a different way. If I move this file to the eos repository I could see if it works there, but I wanted to check if this file or run_predictions is the one we want for eos8fth.

I am ready to incorporate either of these files (have found the dependencies and know the libraries needed in the dockerfile) as soon as we know which we should focus on!

GemmaTuron commented 1 year ago

Can you show me which outputs it gives in the 15 .csv files and the similarity search with prediction dict so I can follow?

The models will all need to be added to the repo, of course, if I am not wrong they are the .pkl files in models-tuned-best. For the output though, what we would do would be merge the 15 files in a single one, if we think they are all relevant

Also, please paste the error code for the last bit here so I can better understand, probably it is an issue with specifying the path to the packages

simrantan commented 1 year ago

Yes absolutely!

For run_prediction.py in batch_screen, these are the output files:

These models are of the fingerprints, rdkit descriptor, and pharmacophore variety. 3CL-sample_data-consensus.csv ACE2-sample_data-consensus.csv AlphaLISA-sample_data-consensus.csv CoV1-PPE_cs-sample_data-consensus.csv CoV1-PPE-sample_data-consensus.csv CPE-sample_data-consensus.csv hCYTOX-sample_data-consensus.csv cytotox-sample_data-consensus.csv MERS-PPE_cs-sample_data-consensus.csv MERS-PPE-sample_data-consensus.csv TruHit-sample_data-consensus.csv

For run_script.py, I have not received an output because of the error (which occurs when downloading the whole repo and when run in an environment I created in isolation from the repo)

This is the error: Traceback (most recent call last): File "run_script.py", line 768, in main() File "run_script.py", line 742, in main start_time = time() TypeError: 'module' object is not callable

It comes from this part of the code:

def main():

# Calculate start time
**start_time = time()**

parser = argparse.ArgumentParser()
parser.add_argument('--smiles', action='store', dest='smiles', required=False, type=str, help='SMILES string')
args = parser.parse_args()

I followed the run instructions exactly so I am wondering if this file maybe does not work? Run instructions: #########################----DOCUMENTATION OF THIS .PY FILE GIVEN BELOW-------###################################

'''

Example command: python3 run_script.py --smiles "CCCCO"
Takes one argument (SMILES string) and returns a top_n_smi_similarity dict and prediction_dict

Since run_script.py is not working for now, I have worked on forking and moving the files for run_predictions.py to the eos template and changing the paths to reflect the new folders the files are in.

GemmaTuron commented 1 year ago

@simrantan just to be clear

When you run, as they indicate: python3 run_predictions.py --csvfile <PATH_TO_CSV_FILE> --results <PATH_TO_SAVE_RESULTS> You get the 15 files separately or they collate all of them in a single one?

I think what we want is the consensus column of each file, which will tell you if the molecule is or not 3CL inhibitor, cytotoxic etc (each activity is indicated by the header of the file). So we should just parse the output of these files into a single one where each column name indicates the activity. Does this make sense? If they provide one single file as output when running the above command it will just be a matter of copying the piece of code to do it, otherwise we'll have to do it manually

simrantan commented 1 year ago

I get 11 (I miscounted 15 in their models_best_tuned folder, there are more models in that folder than are actually used - the number of CSV output files I get is 11) seperate files - the <PATH_TO_SAVE_RESULTS> must be a directory for the command to work, and the directory is populated with all of the output files.

That makes sense to me - I will manually implement the parsing into a single file today!

simrantan commented 1 year ago

I have worked on the input and output adapter today - I also realized I needed to do some path changes in more than just run_predictions, so I updated the paths. While doing so, I found I was missing some more files (a folder called "scalers" and another file) so I added that to the eos model and updated the corresponding paths that rely on them. I spent some time researching how to best collate the numerous output files into one csv file, and I found a method that should work but I am editing the source code to do so - is this okay, or should I try to find a solution that works on adapting from main.py? I went with editing the source code because i wasn't sure how to get around the "path to save results" directory arguement (maybe a temp directory??). I have begun working on implementing it today and will finish tomorrow but I can switch my approach if preferred.

GemmaTuron commented 1 year ago

Hi @simrantan !

Maybe if you can prepare to show this piece of code you are modifying in the 1:1 meeting today I can guide you better. It is preferred to not modify the source code, we could create a temporal file where to save the results and then take them from there and collate them, that would be the easiest perhaps, specifying something like:

import tempfile
import os
import shutil

# Create a temporary directory using the context manager
with tempfile.TemporaryDirectory() as tmp_dir:
    print(f'Temporary directory: {tmp_dir}')
    # Now we can create a file inside this directory
    temp_file_path = os.path.join(tmp_dir, 'temp_file.txt')
    # Open the file and read the contents
    with open(temp_file_path, 'r') as temp_file:
        print(f'File content: {temp_file.read()}')

sorry, this is a qucik and dirty pasted from Chat GPT but I hope you get the idea. let me know if this is helpful

simrantan commented 1 year ago

This is the code I've written so far (in run_predictions.py), but it is easily moved into main.py (I haven't edited any source code, just added this function): `def combine_consensus_results(output_files, combined_output_file):

empty list for consensus columns

consensus_columns = []

# extract the consensus column
for file in output_files:
    df = pd.read_csv(file)
    consensus_columns.append(df['Consensus'])

# Create new DataFrame w consensus 
combined_df = pd.concat(consensus_columns, axis=1)

# Rename the columns
combined_df.columns = [file.split('-')[0] for file in output_files]

# Write to a new CSV file
combined_df.to_csv(combined_output_file, index=False)`

GemmaTuron commented 1 year ago

That looks good but we do not want the output files remaining there, so they should be removed eventually

simrantan commented 1 year ago

@GemmaTuron I ended up switching from working in the source code to working in main.py - It seems more straightforward, though a little less efficient. After looking around it does seem like the most important column is "consensus", and I have added some functions to main.py to make combine the columns from the different files (it was pretty simple with pandas!).

One question - run_predicitions takes in a file containing smiles like Ersilia, so I wasn't sure about writing an input adapter since the input is fine as is? Let me know if I should anyway.

I have included a link to my edited main.py (I left the base input adapter function in, just in case, but am not using it currently) I am planning on testing this file now to make sure it works and will continue debugging as needed!

Main.py

simrantan commented 1 year ago

@GemmaTuron I am currently working on fixing a bug I am running into with LFS and I have two questions (and will add some details on the bug, in case there is any advice)

When running the model locally (using bash) I am constantly runnning into an error about some missing module because the environment doesn't have something or the other downloaded and this model requires a lot of modules to run - I am currently just creating a conda environment and manually installing the modules but I was wondering if this means there is something wrong with the model (in the example in Model incorporation workflow, it does not mention working an in an environment where rdkit is installed, etc.)
How do you classify the "task" of a model? I have found this link that describes what redial does (how I figured the consensus columns were most important), but I am unsure what the terms for tasks that Ersilia typically uses are. I have right now put in "Prediction" as placeholder text.

Also, I followed the instructions tracking the checkpoint files with git lfs but I am getting this error after implementation: Encountered 78 files that should have been pointers, but weren't: I have been looking online to see if there is some step I am missing and some websites have suggested hard resetting git lfs while others suggest "migrating" files one by one using this command: git lfs migrate import --yes --no-rewrite "FILENAME" However, I recieved this error when trying it on a file: Could not rewrite "TopologicalPharmacophoreAtomTripletsFingerprints.pl": unable to find entry TopologicalPharmacophoreAtomTripletsFingerprints.pl in tree

I am currently trying to see if there are other solutions but otherwise I may try resetting git lfs in my terminal and hope it works. If anyone has dealt with a similar issue I would love to hear about their solutions!

simrantan commented 1 year ago

I have created the environment with all the modules so that I can test the model, and I have found some bugs that I have worked on debugging. I worked on a bug relating to a missing argument for the get_predictions function, which I spent some time looking into and eventually found that I needed to be making a temporary directory for this function to run since it is one of the expected arguements. I also had an error that eventually led to me finding my input adapter needed an extra step (adding a column header, 'SMILES') that I implemented as well. I have also been working on the git lfs issue, deleting and reinstalling git lfs and attempting the lfs steps from the model incorporation workflow again. I managed to fix the git lfs issue for most files, but for some reason 32 files are still recieiving the error. I am working on investigating why this issue persists and will experiment with adding the remaining files individually tomorrow.

I have gotten stuck on one part - While testing the model, I have run into this issue -

'Traceback (most recent call last):
  File "eos8fth/model/framework/code/main.py", line 71, in <module>
    output_consensus = my_model("tmp_input.smi")
  File "eos8fth/model/framework/code/main.py", line 45, in my_model
    get_predictions(temp_dir, temp_results_folder, csv_file)
  File "/home/simran/eos8fth/model/framework/code/run_predictions.py", line 80, in get_predictions
    features_dictn = automate(temp_dir, csv_file)
  File "/home/simran/eos8fth/model/framework/code/run_predictions.py", line 35, in automate
    dictn = json.load(open('dictn_models_fp.json', 'r'))
FileNotFoundError: [Errno 2] No such file or directory: 'dictn_models_fp.json' '

This is odd, since dictn_models_fp.json is in the same directory as run_predictions.py:

It should be the same relative path as the source code, since the two files are also in the same folder there:

And the source code worked with no errors when I ran it. I have been looking into this issue for almost two hours now, looking into similar issues on the internet and trying different paths to see if something will work to little success. I have tried copying the relative path (according to vscode) which still gave me the issue we see below -

'  File "eos8fth/model/framework/code/main.py", line 71, in <module>
    output_consensus = my_model("tmp_input.smi")
  File "eos8fth/model/framework/code/main.py", line 45, in my_model
    get_predictions(temp_dir, temp_results_folder, csv_file)
  File "/home/simran/eos8fth/model/framework/code/run_predictions.py", line 80, in get_predictions
    features_dictn = automate(temp_dir, csv_file)
  File "/home/simran/eos8fth/model/framework/code/run_predictions.py", line 35, in automate
    dictn = json.load(open("model/framework/code/dictn_models_fp.json", 'r'))
FileNotFoundError: [Errno 2] No such file or directory: 'model/framework/code/dictn_models_fp.json'

I am feeling a little bit stuck on this issue. I would love any input if there is any advice. I am going to try using the absolute path just to see if it works, but that will not be sustainable since the abs path is from my desktop which is not universally available.

simrantan commented 1 year ago

With Zakia's help, I've explored the FileNotFoundError and we found that when she used the absolute path for dictn_models_fp.json, it did not throw the error. I implemented this in the code by adding file_path = os.path.abspath('eos8fth/model/framework/code/dictn_models_fp.json') to get the abspath and then pass it into the code as the path for the file. This worked, and I'm now working on this error I recieved.

Traceback (most recent call last):
  File "/home/simran/eos8fth/model/framework/code/get_features.py", line 34, in get_fingerprints
    fp = fpFunc_dict[fp_name](m)
  File "/home/simran/eos8fth/model/framework/code/config.py", line 81, in <lambda>
    fpFunc_dict['tpatf'] = lambda m: get_tpatf(m)
  File "/home/simran/eos8fth/model/framework/code/config.py", line 54, in get_tpatf
    tpatf_arr = tpatf_arr.reshape(1, tpatf_arr.shape[0])
IndexError: tuple index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "eos8fth/model/framework/code/main.py", line 71, in <module>
    output_consensus = my_model("tmp_input.smi")
  File "eos8fth/model/framework/code/main.py", line 45, in my_model
    get_predictions(temp_dir, temp_results_folder, csv_file)
  File "/home/simran/eos8fth/model/framework/code/run_predictions.py", line 82, in get_predictions
    features_dictn = automate(temp_dir, csv_file)
  File "/home/simran/eos8fth/model/framework/code/run_predictions.py", line 39, in automate
    pharmacophore = fg.get_fingerprints(stand_df, 'dummy_name', 'tpatf', 'dummy_split', 'dummpy_numpy_folder')
  File "/home/simran/eos8fth/model/framework/code/get_features.py", line 41, in get_fingerprints
    add = [np.nan for i in range(self.fingerprints[0].shape[1])]

My current belief is that this error is likely caused by some sort of missing data that this function is attempting to access that is present in the source code but not the eos template. I am planning to look into this issue and find the missing information. Apologies for the time it is taking to debug the model, it is somewhat complex and I am trying to get through the bug-fixing as quickly as possible.

I am also working on the fact that there are 32 of 78 lfs files that are not being tracked as pointers. I have tried to add these files individually but ls-files shows they are not being added to the "staged to commit" category. I am unsure why these files are not being added, as the command I wrote was git add models_tuned_best, which worked on half the files in that folder but somehow failed on the other half. I may go through the process of disabling and re-enabling lfs to see if I can fix it by starting over.

simrantan commented 1 year ago

After a lot of hunting I found that the root of the error I was receiving was that the file TopologicalPharmacophoreAtomTripletsFingerprints.pl was unable to work because it was missing a Perl environment. I spent some time reading the source code to see how they implemented the Perl environment, and found that they have it in the directory Mayachemtools and the file accesses the environment from that directory. When looking into this, I also found that this file requires numerous other files that I have not transferred to function. Given all these dependencies, and that the dependencies have dependencies also within mayachemtools, I decided to port the whole library over. I tested it, and for some reason got the same error, then realized the cp command failed to include the "lib" directory that contains some of the necessary files because of a gitignore on it. I added the missing folder, removed the gitignore, and plan to test again on Monday.

GemmaTuron commented 1 year ago

Hi @simrantan

Did you check again? If adding the dependencies worked, please open a PR

simrantan commented 1 year ago

I have fixed all of the filepath errors and gotten the model to run successfully! I also managed to remove the text being output to the terminal by supressing the subprocess that was running from one of the models. I found that the output function was not working correctly, and that there was also an extra file being created (an interim file called consensus_files.csv).

I was a little stumped by this issue, since I was unsure where the error was. After looking at consensus_files.csv, I found that this file was the output we were looking for, and the error was not in how I was processing the output of the source code but in my output adapter function. Then, I spent some time debugging the output adapter, and I finally got the desired output to the right destination! I am now working on looking at the sizes of the files in the checkpoints folder, as Miquel recommended that if the files are smaller than 50 or 25 MB, I do not need to mess with lfs. I am also going to conduct final tests (right now, I am using bash, and I will soon fetch and test the model locally to make sure everything is good to go)

simrantan commented 1 year ago

I tested the model using ersilia fetch, and ran into a model_package install error. After looking through the error log, I found that it was an issue with rdkit:

rdkit==2020.03.1 (from versions: 2022.3.3, 2022.3.4, 2022.3.5, 2022.9.1, 2022.9.2, 2022.9.3, 2022.9.4, 2022.9.5, 2023.3.1b1, 2023.3.1, 2023.3.2)
ERROR: No matching distribution found for rdkit==2020.03.1
Collecting numpy==1.19.2
  Using cached numpy-1.19.2-cp37-cp37m-manylinux2010_x86_64.whl (14.5 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.21.6
    Uninstalling numpy-1.21.6:
      Successfully uninstalled numpy-1.21.6
Successfully installed numpy-1.19.2
Collecting hypopt==1.0.9
  Using cached hypopt-1.0.9-py2.py3-none-any.whl (13 kB)
Collecting scikit-learn>=0.18
  Using cached scikit_learn-1.0.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.8 MB)

So, I looked up compatible rdkit versions and found that version 2023.3.1 seemed to work. I then received this error:


Traceback (most recent call last):
  File "pack.py", line 2, in <module>
    from src.service import load_model
  File "/home/simran/eos/dest/eos8fth/src/service.py", line 3, in <module>
    from bentoml import BentoService, api, artifacts
  File "/home/simran/miniconda3/envs/eos8fth/lib/python3.7/site-packages/bentoml/__init__.py", line 28, in <module>
    from bentoml.service import (  # noqa: E402
  File "/home/simran/miniconda3/envs/eos8fth/lib/python3.7/site-packages/bentoml/service/__init__.py", line 38, in <module>
    from bentoml.service.inference_api import InferenceAPI
  File "/home/simran/miniconda3/envs/eos8fth/lib/python3.7/site-packages/bentoml/service/inference_api.py", line 24, in <module>
    import flask
  File "/home/simran/miniconda3/envs/eos8fth/lib/python3.7/site-packages/flask/__init__.py", line 14, in <module>
    from jinja2 import escape
ImportError: cannot import name 'escape' from 'jinja2' (/home/simran/miniconda3/envs/eos8fth/lib/python3.7/site-packages/jinja2/__init__.py)

02:17:09 | DEBUG    | Activation done
02:17:09 | DEBUG    | Previous command successfully run inside eos8fth conda environment
02:17:09 | DEBUG    | Now trying to establish symlinks
02:17:09 | DEBUG    | BentoML location is None
🚨🚨🚨 Something went wrong with Ersilia 🚨🚨🚨

I looked through the error log further back, and I found that this is likely the root:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
docker 6.1.3 requires requests>=2.26.0, but you have requests 2.24.0 which is incompatible.
docker 6.1.3 requires urllib3>=1.26.0, but you have urllib3 1.25.11 which is incompatible.
Successfully installed chardet-3.0.4 idna-2.10 requests-2.24.0 urllib3-1.25.11
Collecting pubchempy==1.0.4
  Using cached PubChemPy-1.0.4-py3-none-any.whl
Installing collected packages: pubchempy
Successfully installed pubchempy-1.0.4
Collecting func_timeout==4.3.5
  Using cached func_timeout-4.3.5-py3-none-any.whl
Installing collected packages: func_timeout
Successfully installed func_timeout-4.3.5
Collecting xgboost==1.0.2
  Using cached xgboost-1.0.2-py3-none-manylinux1_x86_64.whl (109.7 MB)
Requirement already satisfied: scipy in /home/simran/miniconda3/envs/eos8fth/lib/python3.7/site-packages (from xgboost==1.0.2) (1.7.3)
Requirement already satisfied: numpy in /home/simran/miniconda3/envs/eos8fth/lib/python3.7/site-packages (from xgboost==1.0.2) (1.19.2)
Installing collected packages: xgboost
Successfully installed xgboost-1.0.2
Collecting scikit-learn==0.22.1
  Using cached scikit_learn-0.22.1-cp37-cp37m-manylinux1_x86_64.whl (7.0 MB)
Requirement already satisfied: joblib>=0.11 in /home/simran/miniconda3/envs/eos8fth/lib/python3.7/site-packages (from scikit-learn==0.22.1) (1.3.1)
Requirement already satisfied: numpy>=1.11.0 in /home/simran/miniconda3/envs/eos8fth/lib/python3.7/site-packages (from scikit-learn==0.22.1) (1.19.2)
Requirement already satisfied: scipy>=0.17.0 in /home/simran/miniconda3/envs/eos8fth/lib/python3.7/site-packages (from scikit-learn==0.22.1) (1.7.3)
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0.2
    Uninstalling scikit-learn-1.0.2:
      Successfully uninstalled scikit-learn-1.0.2
Successfully installed scikit-learn-0.22.1

The original source code uses a yml file, which looks like this:

name: redial-2020
channels:
  - rdkit
  - conda-forge
dependencies:
  - rdkit::rdkit==2020.03.1
  - python=3.7.9
  - pip=20.2.3
  - pip:
    - -r requirements.txt

The dockerfile is the rdkit install plus all of the statements in requirements.txt, and I am working on finding out why using environment.yml works for the source code, but this dockerfile is causing install errors - I will update with more findings soon.

simrantan commented 1 year ago

@GemmaTuron

Once this dockerfile issue is wrapped up, I will be ready to submit a PR. Right now, I am dealing with this issue:

docker 6.1.3 requires requests>=2.26.0, but you'll have requests 2.24.0 which is incompatible.
docker 6.1.3 requires urllib3>=1.26.0, but you'll have urllib3 1.25.11 which is incompatible.

I have also included the whole error log: eos8fth_dependency_log.txt

It seems that some packages that the model requires are conflicting with the ones that docker needs - is there a known workaround for this? once I have it this model will be complete!

simrantan commented 1 year ago

Thanks for all the tips in the meeting today! I have been working on fixing the dependency issues, and removing the specific version for requests helped. There came up another issue with flask and jinja2, where I removed the versions again. The dockerfile now looks like this:


RUN pip install rdkit==2023.3.1
RUN pip install numpy==1.19.2
RUN pip install hypopt==1.0.9
RUN pip install argparse==1.4.0
RUN pip install tqdm==4.49.0
RUN pip install flask
RUN pip install cairosvg==2.4.2
RUN pip install requests
RUN pip install pubchempy==1.0.4
RUN pip install func_timeout==4.3.5
RUN pip install xgboost==1.0.2
RUN pip install scikit-learn==0.22.1
RUN pip install pandas==1.1.2

WORKDIR /repo
COPY . /repo

I am now getting this error: modulelog.txt

  File "/home/simran/miniconda3/envs/ersilia/bin/ersilia", line 33, in <module>
    sys.exit(load_entry_point('ersilia', 'console_scripts', 'ersilia')())
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/bentoml/cli/click_utils.py", line 138, in wrapper
    return func(*args, **kwargs)
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/bentoml/cli/click_utils.py", line 115, in wrapper
    return_value = func(*args, **kwargs)
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/bentoml/cli/click_utils.py", line 99, in wrapper
    return func(*args, **kwargs)
  File "/home/simran/ersilia/ersilia/cli/commands/fetch.py", line 73, in fetch
    _fetch(mf, model_id)
  File "/home/simran/ersilia/ersilia/cli/commands/fetch.py", line 12, in _fetch
    mf.fetch(model_id)
  File "/home/simran/ersilia/ersilia/hub/fetch/fetch.py", line 348, in fetch
    self._fetch_not_from_dockerhub(model_id=model_id)
  File "/home/simran/ersilia/ersilia/hub/fetch/fetch.py", line 302, in _fetch_not_from_dockerhub
    self._sniff()
  File "/home/simran/ersilia/ersilia/hub/fetch/fetch.py", line 226, in _sniff
    sn = ModelSniffer(self.model_id, self.config_json)
  File "/home/simran/ersilia/ersilia/hub/fetch/actions/sniff.py", line 69, in __init__
    eg = ExampleGenerator(model_id, config_json=config_json)
  File "/home/simran/ersilia/ersilia/io/input.py", line 180, in __init__
    self.IO = BaseIOGetter(config_json=config_json).get(model_id)
  File "/home/simran/ersilia/ersilia/io/input.py", line 68, in get
    return self._get_from_model(model_id=model_id)
  File "/home/simran/ersilia/ersilia/io/input.py", line 54, in _get_from_model
    return importlib.import_module(module, package="ersilia.io").IO(
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'ersilia.io.types.smiles'

Which is now indicating an issue in ersilia not finding a module.

I have looked in ersilia, and this module does indeed not exist:

I am trying to find the source of where the command looking for ersilia.io.types.smiles is, and wanted to know if anyone had an idea of why it might be looking for this nonexistent file.

simrantan commented 1 year ago

I resolved that error! The model now runs, it is just having an empty output error I am looking into.

EmptyOutputError

Detailed error:
Model API eos8fth:run did not produce an outputTraceback (most recent call last):
  File "/home/simran/eos/repository/eos8fth/20230728000937_87271D/eos8fth/artifacts/framework/code/main.py", line 75, in <module>
    output_consensus = my_model("tmp_input.smi")
  File "/home/simran/eos/repository/eos8fth/20230728000937_87271D/eos8fth/artifacts/framework/code/main.py", line 49, in my_model
    get_predictions(temp_dir, temp_results_folder, csv_file)
  File "/home/simran/eos/repository/eos8fth/20230728000937_87271D/eos8fth/artifacts/framework/code/run_predictions.py", line 82, in get_predictions
    features_dictn = automate(temp_dir, csv_file)
  File "/home/simran/eos/repository/eos8fth/20230728000937_87271D/eos8fth/artifacts/framework/code/run_predictions.py", line 44, in automate
    features_rdkit = fg.get_fingerprints(stand_df, k, 'rdkDes', 'dummy_split', 'dummpy_numpy_folder')
  File "/home/simran/eos/repository/eos8fth/20230728000937_87271D/eos8fth/artifacts/framework/code/get_features.py", line 68, in get_fingerprints
    X = rdkDes_scaler.transform(X)
  File "/home/simran/miniconda3/envs/eos8fth/lib/python3.7/site-packages/sklearn/preprocessing/_data.py", line 414, in transform
    X *= self.scale_
ValueError: operands could not be broadcast together with shapes (2,209) (200,) (2,209)

Hints:

I am working on finding why the input shapes might be incorrect.

GemmaTuron commented 1 year ago

Hi @simrantan

It seems the transform is not getting the right X, could you add print statements so that we see the different outputs? Also, are we sure the changes in the rdkit version are not causing a different number of features ? I have updated the workflows on this model

GemmaTuron commented 1 year ago

Hi @simrantan When you can please update the issue with the current work you are doing. Were you able to pin the rdkit descriptor list that works for this model?

simrantan commented 1 year ago

Hi @GemmaTuron

I prioritized testing and other models last week since communication with Zakia about the working issue was slowed due to time zones, but I have finally looked at the number of descriptors available in every pip installable version of rdkit and rdkit pypi on python 3.7, and none of them have 200 descriptors. I am working on the solution suggested (manually overriding to get the 200 descriptors needed from a version with 208/209) but I also wanted to ask about an alternate solution:

One is copying the environment created by the yml file in the source code and using commands within to create this conda environment. I created the conda environment using the yml and have created a file called environment.spec using this command: `conda env export --no-builds > environment-spec.yml

Now, in the dockerfile I should be able to use this

`COPY environment-spec.yml .
RUN conda env create -f environment-spec.yml
RUN echo "source activate redial-2020" > ~/.bashrc

to make the environment the source code uses. However, I wasn't sure if this was compatible with the ersilia format for the dockerfile and wanted to check before I implement this.

The other solution is using rdkit with more descriptors than needed and finding a way to specify the 200 that this code needs.

GemmaTuron commented 1 year ago

I think The other solution is using rdkit with more descriptors than needed and finding a way to specify the 200 that this code needs. this will be the more definitive solution

simrantan commented 1 year ago

I implemented the restriction to 200 descriptors and found a version of rdkit with the same descriptors(+ 8 more that get ignored) as the original version the code uses. I've been working on debugging this error:

Ersilia exception class:
EmptyOutputError

Detailed error:
Model API eos8fth:run did not produce an outputTraceback (most recent call last):
  File "/home/simran/eos/repository/eos8fth/20230810214658_64FD17/eos8fth/artifacts/framework/code/main.py", line 75, in <module>
    output_consensus = my_model("tmp_input.smi")
  File "/home/simran/eos/repository/eos8fth/20230810214658_64FD17/eos8fth/artifacts/framework/code/main.py", line 49, in my_model
    get_predictions(temp_dir, temp_results_folder, csv_file)
  File "/home/simran/eos/repository/eos8fth/20230810214658_64FD17/eos8fth/artifacts/framework/code/run_predictions.py", line 98, in get_predictions
    model = pickle.load(open(models_tuned_dir + '/' + fp_name + '-' + m + '-balanced_randomsplit7_70_15_15.pkl', 'rb'))
_pickle.UnpicklingError: invalid load key, 'v'.

Error log: eos8ftherror.txt

I have experimented and found that when I use a hardcoded path in pickle.load() there is no unpickling error and I can see that the lines of code after this step run which suggests that it works(a different error appears after running the next few lines of code, but that is because the hardcoded single model is incorrect for the purpose of the function). I am trying to figure out what is incorrect about my dynamic path that makes it malfunction while the hardcoded path works just fine. I am currently working on this solution, adding print statements to see where differences in the expected path vs. the created path is occurring.

GemmaTuron commented 1 year ago

Hi @simrantan

Can you clarify when you say +8 that get ignored what do you mean? Can you show me the piece of code where you are restricting the descriptors to 200? Regarding the paths, please remember to use the absolute paths always

simrantan commented 1 year ago

Yes, this is the code:

 def select_descriptors(self, mol, selected_descriptor_names):
        descriptor_names = [desc[0] for desc in Descriptors.descList]
        selected_indices = [descriptor_names.index(name) for name in selected_descriptor_names]
        descriptors = [Descriptors.descList[i][1](mol) for i in selected_indices]
        return descriptors

    def get_fingerprints(self, df, model, fp_name, split, numpy_folder):

        smiles_list = df['SMILES_stand'].to_list()

        not_found = []
        selected_descriptors_rdkDes = ['MaxEStateIndex', 'MinEStateIndex', 'MaxAbsEStateIndex', 'MinAbsEStateIndex', 'qed', 'MolWt', 'HeavyAtomMolWt', 'ExactMolWt', 'NumValenceElectrons', 'NumRadicalElectrons', 'MaxPartialCharge', 'MinPartialCharge', 'MaxAbsPartialCharge', 'MinAbsPartialCharge', 'FpDensityMorgan1', 'FpDensityMorgan2', 'FpDensityMorgan3', 'BalabanJ', 'BertzCT', 'Chi0', 'Chi0n', 'Chi0v', 'Chi1', 'Chi1n', 'Chi1v', 'Chi2n', 'Chi2v', 'Chi3n', 'Chi3v', 'Chi4n', 'Chi4v', 'HallKierAlpha', 'Ipc', 'Kappa1', 'Kappa2', 'Kappa3', 'LabuteASA', 'PEOE_VSA1', 'PEOE_VSA10', 'PEOE_VSA11', 'PEOE_VSA12', 'PEOE_VSA13', 'PEOE_VSA14', 'PEOE_VSA2', 'PEOE_VSA3', 'PEOE_VSA4', 'PEOE_VSA5', 'PEOE_VSA6', 'PEOE_VSA7', 'PEOE_VSA8', 'PEOE_VSA9', 'SMR_VSA1', 'SMR_VSA10', 'SMR_VSA2', 'SMR_VSA3', 'SMR_VSA4', 'SMR_VSA5', 'SMR_VSA6', 'SMR_VSA7', 'SMR_VSA8', 'SMR_VSA9', 'SlogP_VSA1', 'SlogP_VSA10', 'SlogP_VSA11', 'SlogP_VSA12', 'SlogP_VSA2', 'SlogP_VSA3', 'SlogP_VSA4', 'SlogP_VSA5', 'SlogP_VSA6', 'SlogP_VSA7', 'SlogP_VSA8', 'SlogP_VSA9', 'TPSA', 'EState_VSA1', 'EState_VSA10', 'EState_VSA11', 'EState_VSA2', 'EState_VSA3', 'EState_VSA4', 'EState_VSA5', 'EState_VSA6', 'EState_VSA7', 'EState_VSA8', 'EState_VSA9', 'VSA_EState1', 'VSA_EState10', 'VSA_EState2', 'VSA_EState3', 'VSA_EState4', 'VSA_EState5', 'VSA_EState6', 'VSA_EState7', 'VSA_EState8', 'VSA_EState9', 'FractionCSP3', 'HeavyAtomCount', 'NHOHCount', 'NOCount', 'NumAliphaticCarbocycles', 'NumAliphaticHeterocycles', 'NumAliphaticRings', 'NumAromaticCarbocycles', 'NumAromaticHeterocycles', 'NumAromaticRings', 'NumHAcceptors', 'NumHDonors', 'NumHeteroatoms', 'NumRotatableBonds', 'NumSaturatedCarbocycles', 'NumSaturatedHeterocycles', 'NumSaturatedRings', 'RingCount', 'MolLogP', 'MolMR', 'fr_Al_COO', 'fr_Al_OH', 'fr_Al_OH_noTert', 'fr_ArN', 'fr_Ar_COO', 'fr_Ar_N', 'fr_Ar_NH', 'fr_Ar_OH', 'fr_COO', 'fr_COO2', 'fr_C_O', 'fr_C_O_noCOO', 'fr_C_S', 'fr_HOCCN', 'fr_Imine', 'fr_NH0', 'fr_NH1', 'fr_NH2', 'fr_N_O', 'fr_Ndealkylation1', 'fr_Ndealkylation2', 'fr_Nhpyrrole', 'fr_SH', 'fr_aldehyde', 'fr_alkyl_carbamate', 'fr_alkyl_halide', 'fr_allylic_oxid', 'fr_amide', 'fr_amidine', 'fr_aniline', 'fr_aryl_methyl', 'fr_azide', 'fr_azo', 'fr_barbitur', 'fr_benzene', 'fr_benzodiazepine', 'fr_bicyclic', 'fr_diazo', 'fr_dihydropyridine', 'fr_epoxide', 'fr_ester', 'fr_ether', 'fr_furan', 'fr_guanido', 'fr_halogen', 'fr_hdrzine', 'fr_hdrzone', 'fr_imidazole', 'fr_imide', 'fr_isocyan', 'fr_isothiocyan', 'fr_ketone', 'fr_ketone_Topliss', 'fr_lactam', 'fr_lactone', 'fr_methoxy', 'fr_morpholine', 'fr_nitrile', 'fr_nitro', 'fr_nitro_arom', 'fr_nitro_arom_nonortho', 'fr_nitroso', 'fr_oxazole', 'fr_oxime', 'fr_para_hydroxylation', 'fr_phenol', 'fr_phenol_noOrthoHbond', 'fr_phos_acid', 'fr_phos_ester', 'fr_piperdine', 'fr_piperzine', 'fr_priamide', 'fr_prisulfonamd', 'fr_pyridine', 'fr_quatN', 'fr_sulfide', 'fr_sulfonamd', 'fr_sulfone', 'fr_term_acetylene', 'fr_tetrazole', 'fr_thiazole', 'fr_thiocyan', 'fr_thiophene', 'fr_unbrch_alkane', 'fr_urea']

        for smi in smiles_list:
            try: 
                m = Chem.MolFromSmiles(smi)

                can_smi = Chem.MolToSmiles(m, True)
                if fp_name == 'rdkDes':
                    fp = self.select_descriptors(m, selected_descriptors_rdkDes)
                else:
                    fp = fpFunc_dict[fp_name](m)
                bit_array = np.asarray(fp)
                self.fingerprints.append(bit_array)
            except:
                not_found.append(smi)

                if fp_name == 'tpatf':
                    add = [np.nan for i in range(self.fingerprints[0].shape[1])]
                elif fp_name == 'rdkDes':
                    add = [np.nan for i in range(len(selected_descriptors_rdkDes))]
                else:
                    add = [np.nan for i in range(len(self.fingerprints[0]))]
                tpatf_arr = np.array(add, dtype=np.float32)
                self.fingerprints.append(tpatf_arr) 

                pass

"select_descriptors" takes a list of descriptors and returns only the selected ones. The rest of the code is additions in the exisiting functions. selected_descriptors_rdkDes is the list of descriptors in the original version of rdkit, and I run the function if fp_name=rdkDes to get the descriptors as fp instead of fpFunc_dict to override the current version of rdkit's descriptors being used.

Thank you for the reminder, I am using absolute paths! I think it is the later parts of the path( the file name) that are the root of the issue since if the path was not found I would recieve a path_not_found error instead of a key error

GemmaTuron commented 1 year ago

Hi @simrantan

To be sure, once you solve the path issue, please add print statements and check the descriptors that are being generated, it is very easy to mess in the order for example - paste the results here, thanks

simrantan commented 1 year ago

Hi, I have solved the pickle.load issue! I decided to debug this by creating the original environment from the source code (to rule out dependency issues being the cause) and found the error was still occurring. After some trial and error with the paths, I took a look at the descriptors, since they were the only major change before getting this error. This ended up being the source - I think it was a problem with how the list of descriptors I had were stored as strings, but the names in Descripotrs_desclist are function names, not strings, so trying to compare or search between the two is incompatible. I ended up changing my approach to geting the descriptors, since there were multiple issues arising from the previous version (i think due to if fp_name == 'rkdes' being used multiple times and messing up some of the original functionality) so I started from scratch with a simpler, more compact solution:


descriptor_names =  ['MaxEStateIndex', 'MinEStateIndex', 'MaxAbsEStateIndex', 'MinAbsEStateIndex', 'qed', 'MolWt', 'HeavyAtomMolWt', 'ExactMolWt', 'NumValenceElectrons', 'NumRadicalElectrons', 'MaxPartialCharge', 'MinPartialCharge', 'MaxAbsPartialCharge', 'MinAbsPartialCharge', 'FpDensityMorgan1', 'FpDensityMorgan2', 'FpDensityMorgan3', 'BalabanJ', 'BertzCT', 'Chi0', 'Chi0n', 'Chi0v', 'Chi1', 'Chi1n', 'Chi1v', 'Chi2n', 'Chi2v', 'Chi3n', 'Chi3v', 'Chi4n', 'Chi4v', 'HallKierAlpha', 'Ipc', 'Kappa1', 'Kappa2', 'Kappa3', 'LabuteASA', 'PEOE_VSA1', 'PEOE_VSA10', 'PEOE_VSA11', 'PEOE_VSA12', 'PEOE_VSA13', 'PEOE_VSA14', 'PEOE_VSA2', 'PEOE_VSA3', 'PEOE_VSA4', 'PEOE_VSA5', 'PEOE_VSA6', 'PEOE_VSA7', 'PEOE_VSA8', 'PEOE_VSA9', 'SMR_VSA1', 'SMR_VSA10', 'SMR_VSA2', 'SMR_VSA3', 'SMR_VSA4', 'SMR_VSA5', 'SMR_VSA6', 'SMR_VSA7', 'SMR_VSA8', 'SMR_VSA9', 'SlogP_VSA1', 'SlogP_VSA10', 'SlogP_VSA11', 'SlogP_VSA12', 'SlogP_VSA2', 'SlogP_VSA3', 'SlogP_VSA4', 'SlogP_VSA5', 'SlogP_VSA6', 'SlogP_VSA7', 'SlogP_VSA8', 'SlogP_VSA9', 'TPSA', 'EState_VSA1', 'EState_VSA10', 'EState_VSA11', 'EState_VSA2', 'EState_VSA3', 'EState_VSA4', 'EState_VSA5', 'EState_VSA6', 'EState_VSA7', 'EState_VSA8', 'EState_VSA9', 'VSA_EState1', 'VSA_EState10', 'VSA_EState2', 'VSA_EState3', 'VSA_EState4', 'VSA_EState5', 'VSA_EState6', 'VSA_EState7', 'VSA_EState8', 'VSA_EState9', 'FractionCSP3', 'HeavyAtomCount', 'NHOHCount', 'NOCount', 'NumAliphaticCarbocycles', 'NumAliphaticHeterocycles', 'NumAliphaticRings', 'NumAromaticCarbocycles', 'NumAromaticHeterocycles', 'NumAromaticRings', 'NumHAcceptors', 'NumHDonors', 'NumHeteroatoms', 'NumRotatableBonds', 'NumSaturatedCarbocycles', 'NumSaturatedHeterocycles', 'NumSaturatedRings', 'RingCount', 'MolLogP', 'MolMR', 'fr_Al_COO', 'fr_Al_OH', 'fr_Al_OH_noTert', 'fr_ArN', 'fr_Ar_COO', 'fr_Ar_N', 'fr_Ar_NH', 'fr_Ar_OH', 'fr_COO', 'fr_COO2', 'fr_C_O', 'fr_C_O_noCOO', 'fr_C_S', 'fr_HOCCN', 'fr_Imine', 'fr_NH0', 'fr_NH1', 'fr_NH2', 'fr_N_O', 'fr_Ndealkylation1', 'fr_Ndealkylation2', 'fr_Nhpyrrole', 'fr_SH', 'fr_aldehyde', 'fr_alkyl_carbamate', 'fr_alkyl_halide', 'fr_allylic_oxid', 'fr_amide', 'fr_amidine', 'fr_aniline', 'fr_aryl_methyl', 'fr_azide', 'fr_azo', 'fr_barbitur', 'fr_benzene', 'fr_benzodiazepine', 'fr_bicyclic', 'fr_diazo', 'fr_dihydropyridine', 'fr_epoxide', 'fr_ester', 'fr_ether', 'fr_furan', 'fr_guanido', 'fr_halogen', 'fr_hdrzine', 'fr_hdrzone', 'fr_imidazole', 'fr_imide', 'fr_isocyan', 'fr_isothiocyan', 'fr_ketone', 'fr_ketone_Topliss', 'fr_lactam', 'fr_lactone', 'fr_methoxy', 'fr_morpholine', 'fr_nitrile', 'fr_nitro', 'fr_nitro_arom', 'fr_nitro_arom_nonortho', 'fr_nitroso', 'fr_oxazole', 'fr_oxime', 'fr_para_hydroxylation', 'fr_phenol', 'fr_phenol_noOrthoHbond', 'fr_phos_acid', 'fr_phos_ester', 'fr_piperdine', 'fr_piperzine', 'fr_priamide', 'fr_prisulfonamd', 'fr_pyridine', 'fr_quatN', 'fr_sulfide', 'fr_sulfonamd', 'fr_sulfone', 'fr_term_acetylene', 'fr_tetrazole', 'fr_thiazole', 'fr_thiocyan', 'fr_thiophene', 'fr_unbrch_alkane', 'fr_urea']
descriptor_funcs = {desc_name: getattr(Descriptors, desc_name) for desc_name in descriptor_names}

class FeaturesGeneration:
    def __init__(self):
        self.fingerprints = []
    def get_fingerprints(self, df, model, fp_name, split, numpy_folder):

        smiles_list = df['SMILES_stand'].to_list()

        not_found = []
        for smi in smiles_list:
            try: 
                m = Chem.MolFromSmiles(smi)

                can_smi = Chem.MolToSmiles(m, True)

                # get the 200 descriptos
                descriptor_values = []
                for descriptor_name in descriptor_names:
                    descriptor_func = descriptor_funcs.get(descriptor_name)
                    if descriptor_func:
                        descriptor_value = descriptor_func(m)
                        descriptor_values.append(descriptor_value)
                    else:
                        descriptor_values.append(np.nan)
                print("descriptor_names:")
                print("Descriptor_values:")
                self.fingerprints.append(descriptor_values)
            except:
                self.fingerprints.append([np.nan] * len(descriptor_names))

                if fp_name == 'tpatf':
                    add = [np.nan for i in range(self.fingerprints[0].shape[1])]
                elif fp_name == 'rdkDes':
                    add = [np.nan for i in range(len(self.fingerprints[0]))]
                else:
                    add = [np.nan for i in range(len(self.fingerprints[0]))]
                tpatf_arr = np.array(add, dtype=np.float32)
                self.fingerprints.append(tpatf_arr) 

                pass

        self.fingerprints = np.array(self.fingerprints)

After debugging a bit, the pickle.load error stopped happening and the descriptor names were being found (after some module_not_found errors) - I am now recieiving this error:

Traceback (most recent call last):
  File "eos8fth/model/framework/code/main.py", line 75, in <module>
    output_consensus = my_model("tmp_input.smi")
  File "eos8fth/model/framework/code/main.py", line 49, in my_model
    get_predictions(temp_dir, temp_results_folder, csv_file)
  File "/home/simran/eos8fth/model/framework/code/run_predictions.py", line 100, in get_predictions
    y_pred = model.predict(X_true)
  File "/home/simran/miniconda3/envs/redial-2020/lib/python3.7/site-packages/hypopt/model_selection.py", line 400, in predict
    return self.model.predict(X)
  File "/home/simran/miniconda3/envs/redial-2020/lib/python3.7/site-packages/sklearn/svm/_base.py", line 594, in predict
    y = super().predict(X)
  File "/home/simran/miniconda3/envs/redial-2020/lib/python3.7/site-packages/sklearn/svm/_base.py", line 315, in predict
    X = self._validate_for_predict(X)
  File "/home/simran/miniconda3/envs/redial-2020/lib/python3.7/site-packages/sklearn/svm/_base.py", line 467, in _validate_for_predict
    (n_features, self.shape_fit_[1]))
ValueError: X.shape[1] = 200 should be equal to 1024, the number of features at training time

Unfortunately, this is all the error log I have since I am running this using bash in the original environment (once i know the edits to ensure the necessary descriptors work in the source code environment, I can run it in ersilia since the dependency caused issues make it more difficult to properly debug). I may need to revisit my approach to getting the 200 descriptors once more to solve this new issue and will update more soon!

GemmaTuron commented 1 year ago

Hi @simrantan

This is pointing to the descriptors, they used 1024 features, so 1024 descriptors per molecule as X, where did you get that there were only 200? Maybe they are using more than one?

GemmaTuron commented 1 year ago

Indeed with a quick look I can see here at least more descriptors being used from MayaTools: https://github.com/sirimullalab/redial-2020/blob/v1.0/config.py

@simrantan make sure to revise and understand all the steps taken in the original code to train and run the models

simrantan commented 1 year ago

Hi, I got the number 200 from here:

  File "/home/simran/eos/repository/eos8fth/20230728000937_87271D/eos8fth/artifacts/framework/code/run_predictions.py", line 44, in automate
    features_rdkit = fg.get_fingerprints(stand_df, k, 'rdkDes', 'dummy_split', 'dummpy_numpy_folder')
  File "/home/simran/eos/repository/eos8fth/20230728000937_87271D/eos8fth/artifacts/framework/code/get_features.py", line 68, in get_fingerprints
    X = rdkDes_scaler.transform(X)
  File "/home/simran/miniconda3/envs/eos8fth/lib/python3.7/site-packages/sklearn/preprocessing/_data.py", line 414, in transform
    X *= self.scale_
ValueError: operands could not be broadcast together with shapes (2,209) (200,) (2,209)

And from running this script:

from rdkit import Chem
from rdkit.Chem import Descriptors

keys = [x[0] for x in Descriptors.descList if x[1] is not None]  # Filter out None entries
with open("RDkit_official.txt", "w") as output:
    output.write(str(keys))

desc = dict(Descriptors.descList)

print(len(desc))

keys.sort()
with open("RDkit_sorted_new.txt", "w") as output:
    output.write(str(keys))
~

Which returns a list of 200 descriptors for the rdkit i the environment. I think the issue is that 200 are required for prediction, but 1024 are required for training. Using rdkit unedited, there seems to be no issue, so I will revise my method of limiting descriptors so it does not impede on the training methods.

GemmaTuron commented 1 year ago

Hi @simrantan

The same number of features used at training time will be required for prediction. When using rdkit unedited, you were probably collating the rdkit fps with the rest of features to get 1024, and you modified that part of the code maybe? We are not training any model here, just predicting, so it should not give issues with training methods

simrantan commented 1 year ago

Hi,

That makes sense - I am going through the original file and my changes to see if I my code has made any major functionality changes like that. I am working on running and printing outputs from the original version and my changed one to see where discrepancies start. I spent today focusing on eos1vms (and finally fixed it!), and have just started working on this - i will likely have more updates to provide tomorrow!

simrantan commented 1 year ago

@GemmaTuron

I found the root of this error - I should have made the descriptors change in config.py, not get_features. I have returned get_features to the original code, and added the descriptor selection to config.py.

While testing, I ran into pickle issues again - I have been working on debugging this, and found through printing out the file names that it was a specific file that had an issue - several pkl files would work, then one would fail and the program would abort. I wrote an independent script to check if the file loaded in isolation (to check if it was an issue with the model or the code, since after some research I found the issue could be the file got corrupted when transferring). I found that in isolation, the file also threw an error, which means it is not an issue with eos8fth the model but was actually an issue with the file being corrupted. I deleted and re-transferred the file, and then that pickle file loaded successfully! But, another file threw an error. It appears that somehow, some pkl files have been corrupted, so I am working on identifying and replacing these corrupted files. I will update on how the model is working once I have fixed this!

simrantan commented 1 year ago

@GemmaTuron

I fixed the corrupted files and the model works! It fetched, served, and ran successfully. This is the output on 20 smiles: outputformat.csv I also tested it on eml_canonical to make sure it works on a varied inputs and have submitted a PR.

GemmaTuron commented 1 year ago

Hi @simrantan

I have worked on the model. I've done a few changes, the most important are:

Parsing of the RDKIT descriptors was giving inconsistent results, I've modified th code
I've changed the predictions to get the probability of Active, instead of just o or 1, like they do in the docker implementation and the web app. You can check all the changes in the code, I'll close this issue

simrantan commented 1 year ago

@GemmaTuron Thank you! I had no idea the results were inconsistent. I have looked through the changes in the code and tested it twice as well and the results look consistent and it is very functional- thank you so much!