Closed GemmaTuron closed 1 year ago
Hi @ZakiaYahya
For the first error log, I just see a suspicious path:
File "/home/zakia/eos9be7/model/framework//code/main.py", line 14, in <module>
data = json.load(f)
Could it be this double //
is a problem?
The model needs two lists as input, we kept it as JSON format to make sure we could opass several pairs of pairs. Let's focus first in making the model run using a .json input and running from run.sh and we will take it from there.
Hi @ZakiaYahya
For the first error log, I just see a suspicious path:
File "/home/zakia/eos9be7/model/framework//code/main.py", line 14, in <module> data = json.load(f)
Could it be this double
//
is a problem?
No @GemmaTuron, The data.json is loaded successfully, i have checked it.
Member Author
Right @GemmaTuron Thanks, I'm working on it but still no progress yet. Can you just guide me a little on the input structure, because i've converted csv to json and the structure is like this
[{"smiles":"Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1"},{"smiles":"C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5"},{"smiles":"CC(=O)Nc1sc(nn1)[S](N)(=O)=O"},{"smiles":"CC(O)=O"},{"smiles":"CC(=O)N[C@@H](CS)C(O)=O"},{"smiles":"CC(=O)Oc1ccccc1C(O)=O"},{"smiles":"NC1=NC(=O)c2ncn(COCCO)c2N1"},{"smiles":"OC(C(=O)O[C@H]1C[N+]2(CCCOC3=CC=CC=C3)CCC1CC2)(C1=CC=CS1)C1=CC=CS1"},{"smiles":"CN(C)C\\C=C\\C(=O)NC1=C(O[C@H]2CCOC2)C=C2N=CN=C(NC3=CC(Cl)=C(F)C=C3)C2=C1"}]
But i think the problem is with input format here, the json file is loaded fine as i'm using these checks in the main.py to check that
try:
with open(input_file, "r") as f:
data = json.load(f)
print("JSON data is loaded successfully.")
# You can now work with the 'loaded_data' variable, which contains the parsed JSON.
except json.JSONDecodeError as e:
print("Failed to load JSON data:", str(e))
# Handle the exception or show an error message.
except FileNotFoundError:
print("File not found. Make sure the 'data.json' file exists in the current directory.")
and it gives me "JSON data is loaded successfully"
, here's the log file
json_log.txt
for this single smile input
{"smiles":"Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1"}
and after that it keep running and running, showing me nothing on the terminal. I'm working on it, by going through the code. Thanks.
Hello @GemmaTuron @DhanshreeA So going into the detail and understanding of code, in my understanding the json input file shouldn't be in that format i.e.
[{"smiles":"Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1"},{"smiles":"C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5"},{"smiles":"CC(=O)Nc1sc(nn1)[S](N)(=O)=O"},{"smiles":"CC(O)=O"},{"smiles":"CC(=O)N[C@@H](CS)C(O)=O"},{"smiles":"CC(=O)Oc1ccccc1C(O)=O"},{"smiles":"NC1=NC(=O)c2ncn(COCCO)c2N1"},{"smiles":"OC(C(=O)O[C@H]1C[N+]2(CCCOC3=CC=CC=C3)CCC1CC2)(C1=CC=CS1)C1=CC=CS1"},{"smiles":"CN(C)C\\C=C\\C(=O)NC1=C(O[C@H]2CCOC2)C=C2N=CN=C(NC3=CC(Cl)=C(F)C=C3)C2=C1"}]
As the code is actually not parsing that format, from main.py
i understand that basically it requires 2 smiles at a time i.e. d[0] and d[1], that means the main.py
require a list that contains two smiles in each iteration to calculate fcd scores
. The format should be like this
[
["Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1", "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)
[C@H]3CC=C4c5cccnc5"],
["CC(=O)Nc1sc(nn1)[S](N)(=O)=O", "CC(O)=O"],
["CC(=O)N[C@@H](CS)C(O)=O", "CC(=O)Oc1ccccc1C(O)=O"],
["NC1=NC(=O)c2ncn(COCCO)c2N1", "OC(C(=O)O[C@H]1C[N+]2(CCCOC3=CC=CC=C3)CCC1CC2)(C1=CC=CS1)C1=CC=CS1"],
["CN(C)C\\C=C\\C(=O)NC1=C(O[C@H]2CCOC2)C=C2N=CN=C(NC3=CC(Cl)=C(F)C=C3)C2=C1", "CC(O)=O"]
]
I input this format as json file to main.py
and it is working and giving mefcd score
against each list of smiles. here's the output
eos9be7_ output.csv
Now my question is that, the code i've write to convert csv
to json
is like that it is taking 1st and 2nd smile to pair 1 list and then 3rd and 4rth smile to pair 2nd list and so on i.e.
[
["SMILES1", "SMILES2"],
["SMILES3", "SMILES4"],
...
]
so basically it requires even number of input smiles fromcsv
to complete the pairs. Should it is like that or should it be like,
[
["SMILES1", "SMILES2"],
["SMILES2", "SMILES3"],
...
]
Latest Update
In my opinion, the second approach is more appropriate as in case of odd number
of smiles in a csv
file results in pairing of all smiles even with the last one as well but in case of first approach if the number of smiles in a csv file
is odd, then the last one will not pair up with anyone and the list contains only one smile.
I've tested that scenario as well in which csv
file contain odd number of smiles and the last one is not pair up with any smile and the model ends up at this error, in which d[0] is available but d[1] is not
d0 CN(C)C\C=C\C(=O)NC1=C(O[C@H]2CCOC2)C=C2N=CN=C(NC3=CC(Cl)=C(F)C=C3)C2=C1
Traceback (most recent call last):
File "/home/zakia/eos9be7/model/framework//code/main.py", line 29, in <module>
print("d1",d[1])
IndexError: list index out of range
Thanks
Hello @GemmaTuron I've initially tested only 9 smiles to test the model with run.sh and it is working but when i passed whole eml_canonical.csv, the model fails at some smiles giving an error
d0 C[C@H](CCCC(C)(C)O)[C@H]1CC[C@H]2\C(CCC[C@]12C)=C\C=C3\C[C@@H](O)C[C@H](O)C3=C
d1 [CaH2]
can1 ['C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'O', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'O', 'C', 'C', 'O', 'C', 'C']
can2 ['C']
Traceback (most recent call last):
File "/home/zakia/eos9be7/model/framework//code/main.py", line 66, in <module>
fcd_scores += [fcd.get_fcd(can_smiles1, can_smiles2)]
File "/home/zakia/miniconda3/envs/model-eos9be7/lib/python3.7/site-packages/fcd/FCD.py", line 229, in get_fcd
sigma2=sigma2)
File "/home/zakia/miniconda3/envs/model-eos9be7/lib/python3.7/site-packages/fcd/FCD.py", line 71, in calculate_frechet_distance
covmean, _ = linalg.sqrtm(sigma1.dot(sigma2), disp=False)
File "/home/zakia/miniconda3/envs/model-eos9be7/lib/python3.7/site-packages/scipy/linalg/_matfuncs_sqrtm.py", line 158, in sqrtm
A = _asarray_validated(A, check_finite=True, as_inexact=True)
File "/home/zakia/miniconda3/envs/model-eos9be7/lib/python3.7/site-packages/scipy/_lib/_util.py", line 293, in _asarray_validated
a = toarray(a)
File "/home/zakia/miniconda3/envs/model-eos9be7/lib/python3.7/site-packages/numpy/lib/function_base.py", line 489, in asarray_chkfinite
"array must not contain infs or NaNs")
ValueError: array must not contain infs or NaNs
So, to resolve this error i put try and except block of code to replace NAN/inf fcd_scores with zeros in that specific case i.e.
try:
fcd_score = fcd.get_fcd(can_smiles1, can_smiles2)
except (ValueError, ZeroDivisionError):
# Handle cases where fcd_score is inf or NAN by setting it to zero
fcd_score = 0.0
fcd_scores.append(fcd_score)
Now with this the the model is handling the NAN/inf values and it is working for all pairs of smiles. I'm attaching the output output_eml.csv
Hello @GemmaTuron @DhanshreeA
So, basically i've put code blocks in main.py
so that we could provide it normal csv smiles file
as an input to the model, model take the input smiles and transform them into smile pairs and store them into thejson file
in the working directory. Here's the piece of code to do that
# Read CSV and store data in a list
data = []
with open(input_file, 'r') as csv_file:
csv_reader = csv.reader(csv_file)
header = next(csv_reader) # Skip the header row
for row in csv_reader:
data.append(row[0]) # Assuming single column, extract the first element (SMILES)
# Pair the SMILES strings into a list of lists
pairs = [[data[i], data[i+1]] for i in range(len(data) - 1)]
# Convert the data to the desired JSON format
json_data = json.dumps(pairs)
ROOT = os.path.dirname(os.path.abspath(__file__))
output_json_file = os.path.abspath(os.path.join(ROOT, "..", "input_data.json"))
After that i put the code snippet in main.py
that reads the json file
generated above and handles the exception as well if encounter
# Check whether the json file loaded successfully or not
try:
with open(output_json_file, "r") as f:
data = json.load(f)
print("JSON data is loaded successfully.")
# You can now work with the 'loaded_data' variable, which contains the parsed JSON.
except json.JSONDecodeError as e:
print("Failed to load JSON data:", str(e))
# Handle the exception or show an error message.
except FileNotFoundError:
print("File not found. Make sure the 'data.json' file exists in the current directory.")
# Handle the case when the file is not found.
After reading smiles pairs from json file
, thefcd scores
are calculated. To handle inf/NAN scores
and putting zero instead, i've added following try and except block
try:
fcd_score = fcd.get_fcd(can_smiles1, can_smiles2)
except (ValueError, ZeroDivisionError):
# Handle cases where fcd_score is inf or NAN by setting it to zero
fcd_score = 0.0
fcd_scores.append(fcd_score)
In the end, after writing the fcd scores
in output file
, if user want to delete the generatedjson file
, one can use this piece of code otherwise you can comment it out.
if os.path.isfile(output_json_file):
os.remove(output_json_file)
Here's the modified main.py
file
main-py.txt
The following code is working, performing all the things correctly and giving me the output as well withrun.sh
. Now i'm trying to fetch the same code with --repo-path
.
Hi @ZakiaYahya
Excellent work, thanks. Since this model is calculating distances, it could be misleading to give a 0 instead of inf or Nan. Would you be able to check the range of distances so that we can give a value outside this range (higher than any other) @miquelduranfrigola what do you think? for the rest the code looks fine
Hi @ZakiaYahya
Apologies, I've been thinking about this model and the input format. Your suggestion creates "random" pairs, not allowing the user to really pass in the groups of smiles to compare (for example, we might want to compare a group of 3 molecules with a group of 4) The input for these cases is defined in: https://ersilia.gitbook.io/ersilia-book/ersilia-model-hub/inputs#pair-of-lists-of-molecules You can have a look and reformat the code thinking that the csv format people will pass is the one specified in the ersilia book for pairs of lists or multiple pairs of lists? This model input is complex so ask any questions you have
@GemmaTuron a small question, why are we interested in pairs of pairs as inputs? Taking for example, the 3x4 scenario would imply total 12 smiles pairs? (Each of the 3 smiles being compared to the other 4 smiles?) Or is it a 1:1 correspondence, ie each smile at row1, col1 would only be compared with the smile at row1, col2, thereby leading to no comparison for row 4 with an empty output for row 4?
Hi @ZakiaYahya
Excellent work, thanks. Since this model is calculating distances, it could be misleading to give a 0 instead of inf or Nan. Would you be able to check the range of distances so that we can give a value outside this range (higher than any other) @miquelduranfrigola what do you think? for the rest the code looks fine
Yes @GemmaTuron I'm figuring it out, As suggested by @DhanshreeA, checking those smiles at which fcd throws that error is it due basically the RDkit that fails at those smiles or what
Alright @GemmaTuron I have a discussion on it with @DhanshreeA as well, need to bit go in detail to understand the things, the 1:1 correspondence is quite understandle if you want to just compare two smiles from a list but taking it from to compare a group of 3 molecules with a group of 4, requires a strategy.
Other updates from our 1:1/debugging session for this model:
Hi @ZakiaYahya
Excellent work, thanks. Since this model is calculating distances, it could be misleading to give a 0 instead of inf or Nan. Would you be able to check the range of distances so that we can give a value outside this range (higher than any other) @miquelduranfrigola what do you think? for the rest the code looks fine
canonical_smiles
methods for the same smiles over and over again, then it's just an RDKit error as opposed to a multiprocessing error). Posting the FCD code snippet here for reference:
def canonical(smi):
try:
return Chem.MolToSmiles(Chem.MolFromSmiles(smi))
except:
return None
def canonical_smiles(smiles, njobs=32):
with Pool(njobs) as pool:
return pool.map(canonical, smiles)
@ZakiaYahya is looking into which smiles are not properly being parsed by FCD and update here. I would also recommend using fcd.canonical(...)
in the main.py instead of fcd.canonical_smiles(...)
Thansk @ZakiaYahya and @DhanshreeA
This model calculates the distance between groups of molecules, hence you can pass a list of 3 molecules to be calculated against a list of 4 molecules (Single Input, because it is only ONE comparison), or even a list of 2 molecules against 4 molecules (comparison 1) and a list of 3 molecules against one molecule (comparison 2) (Multiple Input, because it is several comparisons).
In the first case, we had defined the .csv input as two columns, each containing one group for the comparison, I hope this is clear
Ofc, it does not serve for the multiple comparisons, which resorts to the following strategy: Each row is one comparison, with column1 containing the first group for comparison, SMILES separated by a dot (.), and column2 containing the second group for comparison, also SMILES separated by a dot (.)
It maybe would be easier to use only the multiple comparison format with only one row for single input.
does this clarify?
To deal with Nan inputs, I would not put a 0 still, can we return None or empty distance if one molecule cannot be calculated?
Hello @GemmaTuron and @DhanshreeA Yeh after meeting, i got the idea more clearer. I'm making changes in main.py accordingly, will try to take it from example of multiple inputs mentioned in Ersilia gitbook. Once i make it work with run.sh, i'll try to incorporate None or empty distance in it. Thanks
Hello @GemmaTuron @DhanshreeA
I have passed multiple_inputs
as mentioned here and the model is working fine with JSON input
with run.sh
Here's the output
eos9be7_output.csv
for this input
[
[
[
"CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O",
"C1=CN=CC=C1C(=O)NN"
],
[
"CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O"
]
],
[
[
"CC(=O)OC1=CC=CC=C1C(=O)O",
"CC(C)CC1=CC=C(C=C1)C(C)C(=O)O"
],
[
"CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C",
"COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5"
]
]
]
Firstly, i've replaced NAN/inf
fcd_scores with None
instead of zeros
as you have seen in the output, the first entry throws NAN/inf
which is replaced by NONE
.
Secondly, I've checked those smiles for which it is throwing NAN/inf as in the case of this input shared above, it is failing for first pair of smiles but it seems like the problem doesn't lies in canonalization of smiles as i've checked all 3 smiles in that list using the python code below and these smiles are converted to mols and back to smiles successfully.
rdkit_py.txt
Third, The suggestion given by @DhanshreeA to use def fcd_canonical(smi)
instead of def fcd_canonical_smiles(smiles, njobs=32):
as it uses multiprocessing it is trying to create 32 jobs, and maybe computer doesn't have that many cpu cores to spare. But when i tried to use fcd_canonical
instead of fcd_canonical_smiles
it start giving me an error i.e.
input file /home/zakia/eos9be7/model/framework//multiple_inputs.json
JSON data is loaded successfully.
d0 ['CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O', 'C1=CN=CC=C1C(=O)NN']
d1 ['CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O']
d [['CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O', 'C1=CN=CC=C1C(=O)NN'], ['CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O']]
count 1
Traceback (most recent call last):
File "/home/zakia/eos9be7/model/framework//code/main.py", line 44, in <module>
can_smiles1 = [smi for smi in fcd.canonical(d[0]) if ((smi is not None))]
TypeError: 'NoneType' object is not iterable
So, i couldn't figure it out yet. Thanks
Hi @ZakiaYahya Thanks, could we print the d[0]? it might be we are not getting it right?
Hello @GemmaTuron
Yes i printed the d, d[0] and d[1] as well as shown below, The input file is reading successfully. The code is working fine with def fcd_canonical_smiles
but it is not working with def fcd_canonical
and throwing this error. We can then used the one that is already mentioned in the code and it is working i.e. def fcd_canonical_smiles
input file /home/zakia/eos9be7/model/framework//multiple_inputs.json
JSON data is loaded successfully.
d0 ['CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O', 'C1=CN=CC=C1C(=O)NN']
d1 ['CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O']
d [['CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O', 'C1=CN=CC=C1C(=O)NN'], ['CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O']]
count 1
Traceback (most recent call last):
File "/home/zakia/eos9be7/model/framework//code/main.py", line 44, in <module>
can_smiles1 = [smi for smi in fcd.canonical(d[0]) if ((smi is not None))]
TypeError: 'NoneType' object is not iterable
mmm this function should work... Could you push your changes to your fork so I can see it better? thanks!
@GemmaTuron I've push the latest changes to my fork, check it here Thanks
thanks @ZakiaYahya I tried to push changes to your repo but I don't have permits. Here is the edit I did, it was simply a list comprehension syntax issue:
can_smiles1 = [fcd.canonical(smi) for smi in d[0] if smi is not None]
print("Second line", can_smiles1)
can_smiles2 = [fcd.canonical(smi) for smi in d[1] if smi is not None]
Also, I changed the dockerfile (I am running on py 3.10):
FROM bentoml/model-server:0.11.0-py310
MAINTAINER ersilia
RUN pip install rdkit-pypi==2022.3.1b1
RUN pip install fcd==1.1
WORKDIR /repo
COPY . /repo
hope this helps!
@GemmaTuron and @DhanshreeA
Few updates:
(1) The purpose of using def fcd_canonical
instead of def fcd_canonical_smiles
is to got rid of inf/NAN fcd scores but i got
NAN/inf score using both of them on same pair of smiles
(2) Handle NAN/inf
fcd scores with NONE value in output csv file
(3) The model is working with multiple_inputs
in json format withrun.sh
So, i think i should now run it with --repo_path
or we need to do something more in the code atrun.sh
level??
Thanks
Hi @ZakiaYahya
Can you explain a bit more about point 1? When do we get NaN scores?
Hello @GemmaTuron
Actually @DhanshreeA raised a point here we are getting NAN/inf may be because of using fcd_canonical_smiles
because that function is using multiprocessing and it is trying to create 32 jobs, and maybe computer doesn't have that many cpu cores to spare. So, instead of using fcd_canonical_smiles
, we could try using fcd_canonical
to avoid NAN/inf
. But using both functions either fcd_canonical_smiles
or fcd_canonical
i'm getting NAN/inf error at calculating fcd_scores
with same pair of smiles, so in my opinion, their is nothing wrong with canonicalization of smiles or RDkit problem.
What do you think @GemmaTuron ?
You mean the same pair of smiles fails in either the fcd.canonical_smiles
function or the fcd.canonica
l function? I think it is fine, probably the distances are inf in that case which is why it returns a Nan?
How many instances of this we have?
In any case, I'd leave the simpler function (fcd.canonical)
@GemmaTuron Yes both functions gives NAN on same smiles. Last time, when i created a JSON file with a pair of two smiles from eml_canonical, then from 442 smiles, the pairs are 441 and it is giving me NAN/inf at 14 instances that means basically it fails at 7 smiles, because i got NAN/inf at two adjacent pairs because that two adjacent pairs share the same smile.
So the ratio is bit lower. I think we can just put NONE at those fcd_scores. But now i'm just using the multiple_inputs from input gitbook Ersilia and that has just two pairs and it is failing at 1 pair.
I think it is fine @ZakiaYahya as is, just output None if the distance cannot be calculated
Hello @GemmaTuron I'm still just not able to create a JSON or CSV file from eml_canonical that has multiple_input format like in the example below, The one pair has 2 smiles against 1 smile and the other pair has 2 smiles against 2 smiles. So i'm confused how to create a eloborated pair of smiles as an input to further test the model or i just use random smiles and few pairs to test it, Kindly guide me on this
[
[
[
"CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O",
"C1=CN=CC=C1C(=O)NN"
],
[
"CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O"
]
],
[
[
"CC(=O)OC1=CC=CC=C1C(=O)O",
"CC(C)CC1=CC=C(C=C1)C(C)C(=O)O"
],
[
"CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C",
"COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5"
]
]
]
Hi @ZakiaYahya We can totally use a random way of pairing the smiles to test the model, perhaps just a few groups is sufficient, simply take the smiles in eml_canonical and prepare a cvs file that has the following groups: 3, 4 2.1 4,5 2,2 7,8 3,9
Okay @GemmaTuron Just let me create a input file having these pairs and then i'll let you know. Thanks
Hello @GemmaTuron I have created pairs of smile as you suggested above. here's th input file input.csv The corresponding output is here, output.csv
The above inputs didn't contain any smile string that has dot .
operand, but when i include any smile in CSV file that has dot .
operand than CSV treated it as two separates smiles e.g i included '[K+].[I-]'
in a CSV file and csv reads it as['[K+]', '[I-]']
instead of ['[K+].[I-]']
. So, in my opinion we need to stick to JSON file as an input rather than CSV as JSON treated it as single smile rather than two smile strings.
What do you say @GemmaTuron?
Hello @GemmaTuron @DhanshreeA
I've tested the model both withrun.sh
and with --repo-path
using this JSON input file
JSON_input.txt
It is also working with smiles having dot.
operand in it.
Here's the output files for run.sh
and--repo_path
output_repo_path.csv
output_run_sh.csv
Can i start refactoring the model now??
Hello @GemmaTuron I have created pairs of smile as you suggested above. here's th input file input.csv The corresponding output is here, output.csv
The above inputs didn't contain any smile string that has dot
.
operand, but when i include any smile in CSV file that has dot.
operand than CSV treated it as two separates smiles e.g i included'[K+].[I-]'
in a CSV file and csv reads it as['[K+]', '[I-]']
instead of['[K+].[I-]']
. So, in my opinion we need to stick to JSON file as an input rather than CSV as JSON treated it as single smile rather than two smile strings. What do you say @GemmaTuron?
@miquelduranfrigola @GemmaTuron not sure this is a sustainable solution to use JSON where the CSV parsing is incorrect. We should consider changing this behavior at the Ersilia CLI level, what do you think?
Hello @GemmaTuron @DhanshreeA I've tested the model both with
run.sh
and with--repo-path
using this JSON input file JSON_input.txtIt is also working with smiles having dot
.
operand in it. Here's the output files forrun.sh
and--repo_path
output_repo_path.csv output_run_sh.csvCan i start refactoring the model now??
Sounds good @ZakiaYahya please go ahead!
mmm this is a challenging one @DhanshreeA A dot implies disconnection, so two atoms separated by a DOT in a SMILES string are in principle not bonded, not connected. But, some molecules that might have ions bound (like sodium (Na), potassium (K) or iodide (I)) might have a dot -- @miquelduranfrigola what do you think?
mmm this is a challenging one @DhanshreeA A dot implies disconnection, so two atoms separated by a DOT in a SMILES string are in principle not bonded, not connected. But, some molecules that might have ions bound (like sodium (Na), potassium (K) or iodide (I)) might have a dot -- @miquelduranfrigola what do you think?
@GemmaTuron that's very interesting to learn! I always thought ionic bonds are stronger than covalent bonds but I think I understand why they're not considered "bonded" (They're just two charged ions held together by a strong electrical force)
@DhanshreeA and @ZakiaYahya
We will stay with the . to split groups, since this is the most accepted way of doing it, and we will just have to live with the edge cases where there is a . in the MSILES, the ion will be separated from the rest of the molecule. If anything, we could try to check for one atom molecules that appear after the transformation of the .csv, to discard them, but that would be an extra mile
Thanks for the clarification @GemmaTuron
What you're suggesting is achievable. For example, after transformation of the CSV, one can convert all smiles into molecules and check for molecules that have single atoms (as you said) and discard the corresponding smile + its consecutive smile (because at that point it'll not be a molecule, just an ion).
A short snippet for that here @ZakiaYahya:
from rdkit import Chem
smiles = [...] # Contains the DOT separated smiles read from a single column
smiles_to_remove = [] # Stores indices
mols = [Chem.MolFromSmiles(smi) for smi in smiles] # Convert to Molecule objects
i = 0
while i<len(mols):
num_atoms = len(m.GetAtoms()) # Get Atoms in a given Molecule
if num_atoms == 1:
smiles_to_remove.extend([i, i+1])
i+=2
else:
i+=1
modified_smiles = []
for idx, smi in enumerate(smiles):
if idx not in smiles_to_remove:
modified_smiles.append(smi)
But like Gemma said, it's okay if we don't deal with this for now.
Hello @Dhansharee
Let me check, it that works, otherwise i'll simply changes the code to take CSV as an input, test it withrun.sh
and --repo-path
and open PR on it.
Thanks
Good one @DhanshreeA !
Let me know if it works @ZakiaYahya
Hello @GemmaTuron @DhanshreeA Today's update: I've modified main.py according to take CSV file as an input, but as service.py is dealing with JSON file so it still not working with CSV file, i need to modify service.py also so that it process multiple_inputs from CSV as well to run on Ersilia CLI. eos9be7_log.log
thanks @ZakiaYahya for the update, let me know if you get stuck and need help.
Hello @GemmaTuron
Thanks for asking. I have tried different ways so that service.py
process multiple inputs but i' stuck here, as nothing works and i got errors atsmile_1
and smile_2
Here's the original service.py
file that deals with JSON
service_JSON.txt
and here's the modified that service.py
that reads multiple inputs from CSV file
service_CSV.txt
The error log is same i.e. eos9be7_log.log
@GemmaTuron is there any service.py
file in Ersilia model Hub that processes multiple inputs so i get an idea from it. Or can you go through modified service.py
and give me a clue where i'm doing it wrong.
Thanks
@ZakiaYahya I need to look at the main.py for this. Can you push the code to your fork in its current state? I'll clone your fork and try to debug it locally. Right now looking at the files you have provided, I am unable to tell much.
Hello @DhanshreeA Just let me push the changes. Thanks
Hello @DhanshreeA
I've pushed the changes to my fork, you can check it here https://github.com/ZakiaYahya/eos9be7
FYI: the code is working fine with run.sh
, it is failing with --repo-path
at Ersilia CLI due to service.py
because it is not processing CSV as an input. Can you please check my modified service.py
as main.py
is working perfectly with CSV now
Thanks
Hi @ZakiaYahya and @DhanshreeA If I understand it correctly, it works with csv but Ersilia is not recognizing that?
Hi @GemmaTuron
Exactly, it is working with CSV file locally but not at Ersilia level. Because from serve.log
it is basically giving error atinference.py
and service.py
, although in th eroor it throws it shows smile_1 error from main.py
but at the backend in my understanding it is because of service.py
. So, it just seem like ersilia is not processing the CSV in a format like smile_1 and smile_2, because i have done different variations in service.py
so it start taking multiple inputs but somehow it keep throwing me error at smile_1
and smile_2
. Let me run it again, so i can share the serve.log
with you for your review. I'm just stuck at it now, couldn't able to resolve it yet.
Thanks
Hello @GemmaTuron and @DhanshreeA I have started working on this model and realized that in main.py it is not taking
csv file
as an input but ajson file
. As the category to which this model belongs is "SpecialInputs" and that's why i thought of not touching it and i tried running it withrun.sh
and with--repo-path
and the model is not working throwing me an errors which is most probably due to Json input.with run.sh it is giving me log_run_sh.txt and with --repo-path log_repo_path.txt
I even tried change the format of reading the input as
csv
rather than json and it seems like it stucks somewhere in the code , keep running and not showing anything on a terminal so i have to CTRL-C it. So, to figure it out where basically it is doing so, i put print statements everywhere and it stucks atfcd_scores += [fcd.get_fcd(can_smiles1, can_smiles2)]
The log is here log.txt
I even tried to convert the
input.csv
todata.json
and then pass it to model but still it shows another error to me , here's the error log. As i couldn't able to attach the json file here but i've copied the stuff from json to txt to attact it here error_log.txt data_json.txtQuestion So, i just want to know that should i change the input reading format from
json
tocsv
in main.py or just keep the code intact as the name implies "SpecialInputs", So i can further proceed it.Thanks