Closed GemmaTuron closed 1 year ago
@emmakodes
Could you point me to the code that you removed? perhaps these lines were necessary to run the model properly?
Hello @GemmaTuron
In main.py I removed: line 38 num_noise_dict[smi] = []
, line 39 num_molecules_dict[smi] = []
, and (iteration < MAX_ITER)
code in line 40.
In summary, I edited the following code (from line 37 to 40) in main.py:
for smi in smiles_list:
num_noise_dict[smi] = []
num_molecules_dict[smi] = []
while(len(smi_gen_dict[smi][0]) < num_molecules_gen) and (iteration < MAX_ITER):
to:
for smi in smiles_list:
while(len(smi_gen_dict[smi][0]) < num_molecules_gen):
num_noise_dict
, num_molecules_dict
and iteration
were used but were never defined
Hi @emmakodes
We will need to define an stopper to avoid entering in an infinite loop, so we would need to add the iteration parameter, rather than removing it, could you work on that? For the dicts, I think it is fine to remove them at the moment.
Could you also add print statements so that when we run it with Ersilia we see if we are generating molecules but simply not being able to pass them to the output?
Okay, @GemmaTuron thanks. Let me work on your suggestions.
Hello @GemmaTuron
service.py
was not selecting the column that holds the generated molecules but instead was selecting the index column. Also, service.py was saving each generated molecule as a float
instead of a string
. So, I basically refactored line 91 in service.py from R += [{"outcome": [Float(x) for x in r]}]
to R += [{"generated_molecules": [str(r[2])]}]
MAX_ITER
. Basically, the code is as follows:
for smi in smiles_list:
iteration = 0
while(len(smi_gen_dict[smi][0]) < num_molecules_gen) and (iteration < MAX_ITER):
for i in range(len(noise_list)):
smi_canon = mu.canon_smiles(smi)
smi_X = vae.smiles_to_hot(smi_canon, canonize_smiles=True)
smi_z = vae.encode(smi_X)
df = vae.z_to_smiles(smi_z, decode_attempts=250, noise_norm=noise_list[i])
smi_gen_dict[smi][0] += df.smiles.values.tolist()
smi_gen_dict[smi][0] = list(set(smi_gen_dict[smi][0])) #Avoid repeat molecules
iteration += 1
Currently, the model is working and generating molecules even using ersilia. Here is an input file and prediction I got:
input.csv
eos3ae724_cli_pred.csv
fantastic @emmakodes good job. Is there a limit on the number of molecules that are generated?
num_molecules_gen = 20
is defined in main.py . It works in the while
statement so that when the number of generated molecules is equal to or greater than 20 then the model will stop generating molecules. The number of generated molecules can be greater than 20
num_molecules_gen = 20
MAX_ITER = 5 #To avoid infinite loop
for smi in smiles_list:
iteration = 0
while(len(smi_gen_dict[smi][0]) < num_molecules_gen) and (iteration < MAX_ITER):
for i in range(len(noise_list)):
smi_canon = mu.canon_smiles(smi)
smi_X = vae.smiles_to_hot(smi_canon, canonize_smiles=True)
smi_z = vae.encode(smi_X)
df = vae.z_to_smiles(smi_z, decode_attempts=250, noise_norm=noise_list[i])
smi_gen_dict[smi][0] += df.smiles.values.tolist()
smi_gen_dict[smi][0] = list(set(smi_gen_dict[smi][0])) #Avoid repeat molecules
iteration += 1
Hello, @GemmaTuron from the current implementation of the code, the number of generated molecules for a smile can be up to 20 or more. If we want a limit, then I will have to count the number of generated molecules and limit it to 20.
Hi @emmakodes
A limit of 20 sounds good, if there are less than 20 for some molecules, fill in the output csv with Nan values. The output shape will then be a List, not a Flexible List, since it will always have 20 values. thanks!
Okay thanks, @GemmaTuron let me work on this
Hello @GemmaTuron the limit for each molecule is now 20. I used smi_gen_dict[smi][0] = smi_gen_dict[smi][0][:20]
in main.py to select the first 20 generated molecules.
I changed the metadata.json Output Shape
to a List
since the model will always have 20 values.
However, the generated molecules are placed on each of their columns when I make predictions via bash run.sh . test.csv output.csv
output.csv
But, when I make predictions via ersilia: ersilia -v api run -i eml_canonical_simple.csv -o eos3ae7_pred.csv
the generated molecules are placed on a single column instead of on each of their column. You can check the csv files to confirm what I mean.
eos3ae7_pred.csv
I am not sure why it is so. You can also check my fork of the model to see the current working version I have so far.
@emmakodes
The code looks fine, perhaps the metadata is not being updated properly. Open a PR and we'll test once the changes are merged, to see if it is now catching the List output
Hello @GemmaTuron, I started with this model, I first changed
conda install -c conda-forge rdkit=2021.03.4
topip install rdkit==2023.3.1
to avoid any dependency conflict. Then, the model threw up aModuleNotFoundError
of requiring the following packages:pandas
yaml
keras
tensorflow
matplotlib
I discovered that there are specific package version ofkeras
for the model to work properly. Currently, I am testing withkeras==2.0.6
. I removed the import statement ofmatplotlib
from the code and didn't install it since ersilia doesn't necessarily plot graphs.Subsequently, the model threw up the following error:
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I had to replaceexp.json
andzinc.json
with the ones present in the model original repo. It seemsexp.json
andzinc.json
are corrupt.Another issue arose with an:
AttributeError: 'str' object has no attribute 'decode'
I had to installh5py==2.10.0
to resolve the error.The model threw up the following error:
NameError: name 'num_noise_dict' is not defined
,NameError: name 'num_molecules_dict' is not defined
,NameError: name 'iteration' is not defined
I discovered thatnum_noise_dict
,num_molecules_dict
anditeration
were used in the code but were never defined. So, I had to remove the lines they were used.Then, the model fetched successfully. When I serve and make prediction using the following command
ersilia -v api predict -i "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]"
the model is producing the following output:Notice, the outcome is getting a
0.0
instead of generated molecules. When I runbash run_predict.sh . test.csv output.csv
I get generated molecules. output.csv I guess somehow ersilia is not getting the generated_molecules. I am working on finding a fix.