Clean UP & Dockerization eos3ae7

emmakodes commented 1 year ago

Hello @GemmaTuron, I started with this model, I first changed conda install -c conda-forge rdkit=2021.03.4 to pip install rdkit==2023.3.1 to avoid any dependency conflict. Then, the model threw up a ModuleNotFoundError of requiring the following packages: pandas yaml keras tensorflow matplotlib I discovered that there are specific package version of keras for the model to work properly. Currently, I am testing with keras==2.0.6. I removed the import statement of matplotlib from the code and didn't install it since ersilia doesn't necessarily plot graphs.
Subsequently, the model threw up the following error: json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) I had to replace exp.json and zinc.json with the ones present in the model original repo. It seems exp.json and zinc.json are corrupt.
Another issue arose with an: AttributeError: 'str' object has no attribute 'decode' I had to install h5py==2.10.0 to resolve the error.
The model threw up the following error: NameError: name 'num_noise_dict' is not defined, NameError: name 'num_molecules_dict' is not defined, NameError: name 'iteration' is not defined I discovered that num_noise_dict, num_molecules_dict and iteration were used in the code but were never defined. So, I had to remove the lines they were used.
Then, the model fetched successfully. When I serve and make prediction using the following command ersilia -v api predict -i "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]" the model is producing the following output:

{
    "input": {
        "key": "NQQBNZBOOHHVQP-UHFFFAOYSA-N",
        "input": "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]",
        "text": "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]"
    },
    "output": {
        "outcome": [
            0.0,
            null,
            null
        ]
    }
}

Notice, the outcome is getting a 0.0 instead of generated molecules. When I run bash run_predict.sh . test.csv output.csv I get generated molecules. output.csv I guess somehow ersilia is not getting the generated_molecules. I am working on finding a fix.

GemmaTuron commented 1 year ago

@emmakodes

Could you point me to the code that you removed? perhaps these lines were necessary to run the model properly?

emmakodes commented 1 year ago

Hello @GemmaTuron

In main.py I removed: line 38 num_noise_dict[smi] = [] , line 39 num_molecules_dict[smi] = [] , and (iteration < MAX_ITER) code in line 40.

In summary, I edited the following code (from line 37 to 40) in main.py:

for smi in smiles_list:
    num_noise_dict[smi] = []
    num_molecules_dict[smi] = []
    while(len(smi_gen_dict[smi][0]) < num_molecules_gen) and (iteration < MAX_ITER):

to:

for smi in smiles_list:
    while(len(smi_gen_dict[smi][0]) < num_molecules_gen):

num_noise_dict, num_molecules_dict and iteration were used but were never defined

GemmaTuron commented 1 year ago

Hi @emmakodes

We will need to define an stopper to avoid entering in an infinite loop, so we would need to add the iteration parameter, rather than removing it, could you work on that? For the dicts, I think it is fine to remove them at the moment.

Could you also add print statements so that when we run it with Ersilia we see if we are generating molecules but simply not being able to pass them to the output?

emmakodes commented 1 year ago

Okay, @GemmaTuron thanks. Let me work on your suggestions.

emmakodes commented 1 year ago

Hello @GemmaTuron

Molecules were actually generated in ersilia but service.py was not selecting the column that holds the generated molecules but instead was selecting the index column. Also, service.py was saving each generated molecule as a float instead of a string.

So, I basically refactored line 91 in service.py from R += [{"outcome": [Float(x) for x in r]}] to R += [{"generated_molecules": [str(r[2])]}]

I used the iteration parameter and increased its value after each iteration or for loop while its value is less than MAX_ITER. Basically, the code is as follows:

for smi in smiles_list:
iteration = 0
while(len(smi_gen_dict[smi][0]) < num_molecules_gen) and (iteration < MAX_ITER):
    for i in range(len(noise_list)):
        smi_canon = mu.canon_smiles(smi)
        smi_X = vae.smiles_to_hot(smi_canon, canonize_smiles=True)
        smi_z = vae.encode(smi_X)
        df = vae.z_to_smiles(smi_z, decode_attempts=250, noise_norm=noise_list[i])
        smi_gen_dict[smi][0] += df.smiles.values.tolist()
        smi_gen_dict[smi][0] = list(set(smi_gen_dict[smi][0])) #Avoid repeat molecules
    iteration += 1

Currently, the model is working and generating molecules even using ersilia. Here is an input file and prediction I got: input.csv eos3ae724_cli_pred.csv

GemmaTuron commented 1 year ago

fantastic @emmakodes good job. Is there a limit on the number of molecules that are generated?

emmakodes commented 1 year ago

num_molecules_gen = 20 is defined in main.py . It works in the while statement so that when the number of generated molecules is equal to or greater than 20 then the model will stop generating molecules. The number of generated molecules can be greater than 20

num_molecules_gen = 20
MAX_ITER = 5 #To avoid infinite loop

for smi in smiles_list:
    iteration = 0
    while(len(smi_gen_dict[smi][0]) < num_molecules_gen) and (iteration < MAX_ITER):
        for i in range(len(noise_list)):
            smi_canon = mu.canon_smiles(smi)
            smi_X = vae.smiles_to_hot(smi_canon, canonize_smiles=True)
            smi_z = vae.encode(smi_X)
            df = vae.z_to_smiles(smi_z, decode_attempts=250, noise_norm=noise_list[i])
            smi_gen_dict[smi][0] += df.smiles.values.tolist()
            smi_gen_dict[smi][0] = list(set(smi_gen_dict[smi][0])) #Avoid repeat molecules
        iteration += 1

emmakodes commented 1 year ago

Hello, @GemmaTuron from the current implementation of the code, the number of generated molecules for a smile can be up to 20 or more. If we want a limit, then I will have to count the number of generated molecules and limit it to 20.

GemmaTuron commented 1 year ago

Hi @emmakodes

A limit of 20 sounds good, if there are less than 20 for some molecules, fill in the output csv with Nan values. The output shape will then be a List, not a Flexible List, since it will always have 20 values. thanks!

emmakodes commented 1 year ago

Okay thanks, @GemmaTuron let me work on this

emmakodes commented 1 year ago

Hello @GemmaTuron the limit for each molecule is now 20. I used smi_gen_dict[smi][0] = smi_gen_dict[smi][0][:20] in main.py to select the first 20 generated molecules.

I changed the metadata.json Output Shape to a List since the model will always have 20 values.
However, the generated molecules are placed on each of their columns when I make predictions via bash run.sh . test.csv output.csv
output.csv

But, when I make predictions via ersilia: ersilia -v api run -i eml_canonical_simple.csv -o eos3ae7_pred.csv the generated molecules are placed on a single column instead of on each of their column. You can check the csv files to confirm what I mean. eos3ae7_pred.csv

I am not sure why it is so. You can also check my fork of the model to see the current working version I have so far.

GemmaTuron commented 1 year ago

@emmakodes

The code looks fine, perhaps the metadata is not being updated properly. Open a PR and we'll test once the changes are merged, to see if it is now catching the List output

emmakodes commented 1 year ago

Okay @GemmaTuron I have opened the PR

ersilia-os / eos3ae7

Clean UP & Dockerization eos3ae7 #1