Clean UP & Dockerization eos9be7

ZakiaYahya commented 1 year ago

Hello @GemmaTuron and @DhanshreeA I have started working on this model and realized that in main.py it is not taking csv fileas an input but a json file. As the category to which this model belongs is "SpecialInputs" and that's why i thought of not touching it and i tried running it with run.sh and with --repo-pathand the model is not working throwing me an errors which is most probably due to Json input.

with run.sh it is giving me log_run_sh.txt and with --repo-path log_repo_path.txt

I even tried change the format of reading the input as csv rather than json and it seems like it stucks somewhere in the code , keep running and not showing anything on a terminal so i have to CTRL-C it. So, to figure it out where basically it is doing so, i put print statements everywhere and it stucks at

fcd_scores += [fcd.get_fcd(can_smiles1, can_smiles2)]

The log is here log.txt

I even tried to convert theinput.csv to data.json and then pass it to model but still it shows another error to me , here's the error log. As i couldn't able to attach the json file here but i've copied the stuff from json to txt to attact it here error_log.txt data_json.txt

Question So, i just want to know that should i change the input reading format from json tocsv in main.py or just keep the code intact as the name implies "SpecialInputs", So i can further proceed it.

Thanks

GemmaTuron commented 1 year ago

Hi @ZakiaYahya

For the first error log, I just see a suspicious path:

  File "/home/zakia/eos9be7/model/framework//code/main.py", line 14, in <module>
    data = json.load(f)

Could it be this double // is a problem?

GemmaTuron commented 1 year ago

The model needs two lists as input, we kept it as JSON format to make sure we could opass several pairs of pairs. Let's focus first in making the model run using a .json input and running from run.sh and we will take it from there.

ZakiaYahya commented 1 year ago

Hi @ZakiaYahya

For the first error log, I just see a suspicious path:
  File "/home/zakia/eos9be7/model/framework//code/main.py", line 14, in <module>
    data = json.load(f)
Could it be this double // is a problem?

No @GemmaTuron, The data.json is loaded successfully, i have checked it.

ZakiaYahya commented 1 year ago

Member Author

Right @GemmaTuron Thanks, I'm working on it but still no progress yet. Can you just guide me a little on the input structure, because i've converted csv to json and the structure is like this

[{"smiles":"Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1"},{"smiles":"C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5"},{"smiles":"CC(=O)Nc1sc(nn1)[S](N)(=O)=O"},{"smiles":"CC(O)=O"},{"smiles":"CC(=O)N[C@@H](CS)C(O)=O"},{"smiles":"CC(=O)Oc1ccccc1C(O)=O"},{"smiles":"NC1=NC(=O)c2ncn(COCCO)c2N1"},{"smiles":"OC(C(=O)O[C@H]1C[N+]2(CCCOC3=CC=CC=C3)CCC1CC2)(C1=CC=CS1)C1=CC=CS1"},{"smiles":"CN(C)C\\C=C\\C(=O)NC1=C(O[C@H]2CCOC2)C=C2N=CN=C(NC3=CC(Cl)=C(F)C=C3)C2=C1"}]

But i think the problem is with input format here, the json file is loaded fine as i'm using these checks in the main.py to check that

try:
    with open(input_file, "r") as f:
        data = json.load(f)
    print("JSON data is loaded successfully.")
    # You can now work with the 'loaded_data' variable, which contains the parsed JSON.
except json.JSONDecodeError as e:
    print("Failed to load JSON data:", str(e))
    # Handle the exception or show an error message.
except FileNotFoundError:
    print("File not found. Make sure the 'data.json' file exists in the current directory.")

and it gives me "JSON data is loaded successfully", here's the log file json_log.txt

for this single smile input {"smiles":"Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1"}

and after that it keep running and running, showing me nothing on the terminal. I'm working on it, by going through the code. Thanks.

ZakiaYahya commented 1 year ago

Hello @GemmaTuron @DhanshreeA So going into the detail and understanding of code, in my understanding the json input file shouldn't be in that format i.e.

[{"smiles":"Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1"},{"smiles":"C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5"},{"smiles":"CC(=O)Nc1sc(nn1)[S](N)(=O)=O"},{"smiles":"CC(O)=O"},{"smiles":"CC(=O)N[C@@H](CS)C(O)=O"},{"smiles":"CC(=O)Oc1ccccc1C(O)=O"},{"smiles":"NC1=NC(=O)c2ncn(COCCO)c2N1"},{"smiles":"OC(C(=O)O[C@H]1C[N+]2(CCCOC3=CC=CC=C3)CCC1CC2)(C1=CC=CS1)C1=CC=CS1"},{"smiles":"CN(C)C\\C=C\\C(=O)NC1=C(O[C@H]2CCOC2)C=C2N=CN=C(NC3=CC(Cl)=C(F)C=C3)C2=C1"}]

As the code is actually not parsing that format, from main.pyi understand that basically it requires 2 smiles at a time i.e. d[0] and d[1], that means the main.py require a list that contains two smiles in each iteration to calculate fcd scores. The format should be like this

[
   ["Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1", "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C) 
   [C@H]3CC=C4c5cccnc5"],
   ["CC(=O)Nc1sc(nn1)[S](N)(=O)=O", "CC(O)=O"], 
   ["CC(=O)N[C@@H](CS)C(O)=O", "CC(=O)Oc1ccccc1C(O)=O"],
   ["NC1=NC(=O)c2ncn(COCCO)c2N1", "OC(C(=O)O[C@H]1C[N+]2(CCCOC3=CC=CC=C3)CCC1CC2)(C1=CC=CS1)C1=CC=CS1"],
   ["CN(C)C\\C=C\\C(=O)NC1=C(O[C@H]2CCOC2)C=C2N=CN=C(NC3=CC(Cl)=C(F)C=C3)C2=C1", "CC(O)=O"]
]

I input this format as json file to main.py and it is working and giving mefcd scoreagainst each list of smiles. here's the output eos9be7_ output.csv

Now my question is that, the code i've write to convert csv to json is like that it is taking 1st and 2nd smile to pair 1 list and then 3rd and 4rth smile to pair 2nd list and so on i.e.

[
  ["SMILES1", "SMILES2"],
  ["SMILES3", "SMILES4"],
  ...
]

so basically it requires even number of input smiles fromcsvto complete the pairs. Should it is like that or should it be like,

[
  ["SMILES1", "SMILES2"],
  ["SMILES2", "SMILES3"],
  ...
]

Latest Update In my opinion, the second approach is more appropriate as in case of odd number of smiles in a csv file results in pairing of all smiles even with the last one as well but in case of first approach if the number of smiles in a csv file is odd, then the last one will not pair up with anyone and the list contains only one smile.

I've tested that scenario as well in which csv file contain odd number of smiles and the last one is not pair up with any smile and the model ends up at this error, in which d[0] is available but d[1] is not

d0 CN(C)C\C=C\C(=O)NC1=C(O[C@H]2CCOC2)C=C2N=CN=C(NC3=CC(Cl)=C(F)C=C3)C2=C1
Traceback (most recent call last):
  File "/home/zakia/eos9be7/model/framework//code/main.py", line 29, in <module>
    print("d1",d[1])
IndexError: list index out of range

Thanks

ZakiaYahya commented 1 year ago

Hello @GemmaTuron I've initially tested only 9 smiles to test the model with run.sh and it is working but when i passed whole eml_canonical.csv, the model fails at some smiles giving an error

d0 C[C@H](CCCC(C)(C)O)[C@H]1CC[C@H]2\C(CCC[C@]12C)=C\C=C3\C[C@@H](O)C[C@H](O)C3=C
d1 [CaH2]
can1 ['C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'O', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'O', 'C', 'C', 'O', 'C', 'C']
can2 ['C']
Traceback (most recent call last):
  File "/home/zakia/eos9be7/model/framework//code/main.py", line 66, in <module>
    fcd_scores += [fcd.get_fcd(can_smiles1, can_smiles2)]
  File "/home/zakia/miniconda3/envs/model-eos9be7/lib/python3.7/site-packages/fcd/FCD.py", line 229, in get_fcd
    sigma2=sigma2)
  File "/home/zakia/miniconda3/envs/model-eos9be7/lib/python3.7/site-packages/fcd/FCD.py", line 71, in calculate_frechet_distance
    covmean, _ = linalg.sqrtm(sigma1.dot(sigma2), disp=False)
  File "/home/zakia/miniconda3/envs/model-eos9be7/lib/python3.7/site-packages/scipy/linalg/_matfuncs_sqrtm.py", line 158, in sqrtm
    A = _asarray_validated(A, check_finite=True, as_inexact=True)
  File "/home/zakia/miniconda3/envs/model-eos9be7/lib/python3.7/site-packages/scipy/_lib/_util.py", line 293, in _asarray_validated
    a = toarray(a)
  File "/home/zakia/miniconda3/envs/model-eos9be7/lib/python3.7/site-packages/numpy/lib/function_base.py", line 489, in asarray_chkfinite
    "array must not contain infs or NaNs")
ValueError: array must not contain infs or NaNs

So, to resolve this error i put try and except block of code to replace NAN/inf fcd_scores with zeros in that specific case i.e.

    try:
        fcd_score = fcd.get_fcd(can_smiles1, can_smiles2)
    except (ValueError, ZeroDivisionError):
        # Handle cases where fcd_score is inf or NAN by setting it to zero
        fcd_score = 0.0 
    fcd_scores.append(fcd_score)

Now with this the the model is handling the NAN/inf values and it is working for all pairs of smiles. I'm attaching the output output_eml.csv

ZakiaYahya commented 1 year ago

Hello @GemmaTuron @DhanshreeA So, basically i've put code blocks in main.py so that we could provide it normal csv smiles file as an input to the model, model take the input smiles and transform them into smile pairs and store them into thejson file in the working directory. Here's the piece of code to do that

# Read CSV and store data in a list
data = []
with open(input_file, 'r') as csv_file:
    csv_reader = csv.reader(csv_file)
    header = next(csv_reader)  # Skip the header row
    for row in csv_reader:
        data.append(row[0])  # Assuming single column, extract the first element (SMILES)

# Pair the SMILES strings into a list of lists
pairs = [[data[i], data[i+1]] for i in range(len(data) - 1)]

# Convert the data to the desired JSON format
json_data = json.dumps(pairs)
ROOT = os.path.dirname(os.path.abspath(__file__))
output_json_file = os.path.abspath(os.path.join(ROOT, "..", "input_data.json"))

After that i put the code snippet in main.py that reads the json file generated above and handles the exception as well if encounter


# Check whether the json file loaded successfully or not    
try:
    with open(output_json_file, "r") as f:
        data = json.load(f)
    print("JSON data is loaded successfully.")
    # You can now work with the 'loaded_data' variable, which contains the parsed JSON.
except json.JSONDecodeError as e:
    print("Failed to load JSON data:", str(e))
    # Handle the exception or show an error message.
except FileNotFoundError:
    print("File not found. Make sure the 'data.json' file exists in the current directory.")
    # Handle the case when the file is not found.

After reading smiles pairs from json file, thefcd scores are calculated. To handle inf/NAN scores and putting zero instead, i've added following try and except block

    try:
        fcd_score = fcd.get_fcd(can_smiles1, can_smiles2)

    except (ValueError, ZeroDivisionError):
        # Handle cases where fcd_score is inf or NAN by setting it to zero
        fcd_score = 0.0

    fcd_scores.append(fcd_score)

In the end, after writing the fcd scores in output file, if user want to delete the generatedjson file, one can use this piece of code otherwise you can comment it out.

if os.path.isfile(output_json_file):
     os.remove(output_json_file)

Here's the modified main.pyfile main-py.txt

The following code is working, performing all the things correctly and giving me the output as well withrun.sh. Now i'm trying to fetch the same code with --repo-path.

GemmaTuron commented 1 year ago

Hi @ZakiaYahya

Excellent work, thanks. Since this model is calculating distances, it could be misleading to give a 0 instead of inf or Nan. Would you be able to check the range of distances so that we can give a value outside this range (higher than any other) @miquelduranfrigola what do you think? for the rest the code looks fine

GemmaTuron commented 1 year ago

Hi @ZakiaYahya

Apologies, I've been thinking about this model and the input format. Your suggestion creates "random" pairs, not allowing the user to really pass in the groups of smiles to compare (for example, we might want to compare a group of 3 molecules with a group of 4) The input for these cases is defined in: https://ersilia.gitbook.io/ersilia-book/ersilia-model-hub/inputs#pair-of-lists-of-molecules You can have a look and reformat the code thinking that the csv format people will pass is the one specified in the ersilia book for pairs of lists or multiple pairs of lists? This model input is complex so ask any questions you have

DhanshreeA commented 1 year ago

@GemmaTuron a small question, why are we interested in pairs of pairs as inputs? Taking for example, the 3x4 scenario would imply total 12 smiles pairs? (Each of the 3 smiles being compared to the other 4 smiles?) Or is it a 1:1 correspondence, ie each smile at row1, col1 would only be compared with the smile at row1, col2, thereby leading to no comparison for row 4 with an empty output for row 4?

ZakiaYahya commented 1 year ago

Hi @ZakiaYahya

Excellent work, thanks. Since this model is calculating distances, it could be misleading to give a 0 instead of inf or Nan. Would you be able to check the range of distances so that we can give a value outside this range (higher than any other) @miquelduranfrigola what do you think? for the rest the code looks fine

Yes @GemmaTuron I'm figuring it out, As suggested by @DhanshreeA, checking those smiles at which fcd throws that error is it due basically the RDkit that fails at those smiles or what

ZakiaYahya commented 1 year ago

Alright @GemmaTuron I have a discussion on it with @DhanshreeA as well, need to bit go in detail to understand the things, the 1:1 correspondence is quite understandle if you want to just compare two smiles from a list but taking it from to compare a group of 3 molecules with a group of 4, requires a strategy.

DhanshreeA commented 1 year ago

Other updates from our 1:1/debugging session for this model:

Hi @ZakiaYahya

Excellent work, thanks. Since this model is calculating distances, it could be misleading to give a 0 instead of inf or Nan. Would you be able to check the range of distances so that we can give a value outside this range (higher than any other) @miquelduranfrigola what do you think? for the rest the code looks fine

@GemmaTuron the NaN/Inf aren't exactly coming from the distances being too large, they're the result of a parsing error. The FCD code is using RDKit internally to try to "canonicalize" smiles. Particularly, it converts a mol from an input smile and tries to convert it back. On top of that, it's also using multiprocessing (which technically isn't an issue in itself) but it could be that some of these jobs are not getting executed properly, or might not be getting resource allocation returning None for some smiles (like I said, this is rare, and if that is happening, it should be happening randomly - that is to say, if we get None from canonical_smiles methods for the same smiles over and over again, then it's just an RDKit error as opposed to a multiprocessing error). Posting the FCD code snippet here for reference:


def canonical(smi):
    try:
        return Chem.MolToSmiles(Chem.MolFromSmiles(smi))
    except:
        return None

def canonical_smiles(smiles, njobs=32):
    with Pool(njobs) as pool:
        return pool.map(canonical, smiles)

@ZakiaYahya is looking into which smiles are not properly being parsed by FCD and update here. I would also recommend using fcd.canonical(...) in the main.py instead of fcd.canonical_smiles(...)

The other issue is with an output.csv not being read by service.py correctly (or not existing in the first place) which might be trivial to solve once we get the model to run with run.sh and correct csv format.

GemmaTuron commented 1 year ago

Thansk @ZakiaYahya and @DhanshreeA

This model calculates the distance between groups of molecules, hence you can pass a list of 3 molecules to be calculated against a list of 4 molecules (Single Input, because it is only ONE comparison), or even a list of 2 molecules against 4 molecules (comparison 1) and a list of 3 molecules against one molecule (comparison 2) (Multiple Input, because it is several comparisons).

In the first case, we had defined the .csv input as two columns, each containing one group for the comparison, I hope this is clear

Ofc, it does not serve for the multiple comparisons, which resorts to the following strategy: Each row is one comparison, with column1 containing the first group for comparison, SMILES separated by a dot (.), and column2 containing the second group for comparison, also SMILES separated by a dot (.)

It maybe would be easier to use only the multiple comparison format with only one row for single input.

does this clarify?

GemmaTuron commented 1 year ago

To deal with Nan inputs, I would not put a 0 still, can we return None or empty distance if one molecule cannot be calculated?

ZakiaYahya commented 1 year ago

Hello @GemmaTuron and @DhanshreeA Yeh after meeting, i got the idea more clearer. I'm making changes in main.py accordingly, will try to take it from example of multiple inputs mentioned in Ersilia gitbook. Once i make it work with run.sh, i'll try to incorporate None or empty distance in it. Thanks

ZakiaYahya commented 1 year ago

Hello @GemmaTuron @DhanshreeA I have passed multiple_inputs as mentioned here and the model is working fine with JSON input with run.sh Here's the output eos9be7_output.csv

for this input

[
    [
        [
            "CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O",
            "C1=CN=CC=C1C(=O)NN"
        ],
        [
            "CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O"
        ]
    ],
    [
        [
            "CC(=O)OC1=CC=CC=C1C(=O)O",
            "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O"
        ],
        [
            "CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C",
            "COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5"
        ]
    ]
]

Firstly, i've replaced NAN/inf fcd_scores with None instead of zeros as you have seen in the output, the first entry throws NAN/inf which is replaced by NONE. Secondly, I've checked those smiles for which it is throwing NAN/inf as in the case of this input shared above, it is failing for first pair of smiles but it seems like the problem doesn't lies in canonalization of smiles as i've checked all 3 smiles in that list using the python code below and these smiles are converted to mols and back to smiles successfully.
rdkit_py.txt

Third, The suggestion given by @DhanshreeA to use def fcd_canonical(smi) instead of def fcd_canonical_smiles(smiles, njobs=32): as it uses multiprocessing it is trying to create 32 jobs, and maybe computer doesn't have that many cpu cores to spare. But when i tried to use fcd_canonical instead of fcd_canonical_smiles it start giving me an error i.e.

input file /home/zakia/eos9be7/model/framework//multiple_inputs.json
JSON data is loaded successfully.
d0 ['CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O', 'C1=CN=CC=C1C(=O)NN']
d1 ['CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O']
d [['CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O', 'C1=CN=CC=C1C(=O)NN'], ['CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O']]
count 1
Traceback (most recent call last):
  File "/home/zakia/eos9be7/model/framework//code/main.py", line 44, in <module>
    can_smiles1 = [smi for smi in fcd.canonical(d[0]) if ((smi is not None))]
TypeError: 'NoneType' object is not iterable

So, i couldn't figure it out yet. Thanks

GemmaTuron commented 1 year ago

Hi @ZakiaYahya Thanks, could we print the d[0]? it might be we are not getting it right?

ZakiaYahya commented 1 year ago

Hello @GemmaTuron Yes i printed the d, d[0] and d[1] as well as shown below, The input file is reading successfully. The code is working fine with def fcd_canonical_smiles but it is not working with def fcd_canonicaland throwing this error. We can then used the one that is already mentioned in the code and it is working i.e. def fcd_canonical_smiles

input file /home/zakia/eos9be7/model/framework//multiple_inputs.json
JSON data is loaded successfully.
d0 ['CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O', 'C1=CN=CC=C1C(=O)NN']
d1 ['CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O']
d [['CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O', 'C1=CN=CC=C1C(=O)NN'], ['CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O']]
count 1
Traceback (most recent call last):
  File "/home/zakia/eos9be7/model/framework//code/main.py", line 44, in <module>
    can_smiles1 = [smi for smi in fcd.canonical(d[0]) if ((smi is not None))]
TypeError: 'NoneType' object is not iterable

GemmaTuron commented 1 year ago

mmm this function should work... Could you push your changes to your fork so I can see it better? thanks!

ZakiaYahya commented 1 year ago

@GemmaTuron I've push the latest changes to my fork, check it here Thanks

GemmaTuron commented 1 year ago

thanks @ZakiaYahya I tried to push changes to your repo but I don't have permits. Here is the edit I did, it was simply a list comprehension syntax issue:

    can_smiles1 = [fcd.canonical(smi) for smi in d[0] if smi is not None]
    print("Second line", can_smiles1)
    can_smiles2 = [fcd.canonical(smi) for smi in d[1] if smi is not None]

Also, I changed the dockerfile (I am running on py 3.10):

FROM bentoml/model-server:0.11.0-py310
MAINTAINER ersilia

RUN pip install rdkit-pypi==2022.3.1b1
RUN pip install fcd==1.1

WORKDIR /repo
COPY . /repo

hope this helps!

ZakiaYahya commented 1 year ago

@GemmaTuron and @DhanshreeA Few updates: (1) The purpose of using def fcd_canonical instead of def fcd_canonical_smiles is to got rid of inf/NAN fcd scores but i got NAN/inf score using both of them on same pair of smiles (2) Handle NAN/inf fcd scores with NONE value in output csv file (3) The model is working with multiple_inputs in json format withrun.sh

So, i think i should now run it with --repo_path or we need to do something more in the code atrun.sh level?? Thanks

GemmaTuron commented 1 year ago

Hi @ZakiaYahya

Can you explain a bit more about point 1? When do we get NaN scores?

ZakiaYahya commented 1 year ago

Hello @GemmaTuron Actually @DhanshreeA raised a point here we are getting NAN/inf may be because of using fcd_canonical_smiles because that function is using multiprocessing and it is trying to create 32 jobs, and maybe computer doesn't have that many cpu cores to spare. So, instead of using fcd_canonical_smiles, we could try using fcd_canonical to avoid NAN/inf. But using both functions either fcd_canonical_smiles or fcd_canonical i'm getting NAN/inf error at calculating fcd_scores with same pair of smiles, so in my opinion, their is nothing wrong with canonicalization of smiles or RDkit problem. What do you think @GemmaTuron ?

GemmaTuron commented 1 year ago

You mean the same pair of smiles fails in either the fcd.canonical_smiles function or the fcd.canonical function? I think it is fine, probably the distances are inf in that case which is why it returns a Nan? How many instances of this we have? In any case, I'd leave the simpler function (fcd.canonical)

ZakiaYahya commented 1 year ago

@GemmaTuron Yes both functions gives NAN on same smiles. Last time, when i created a JSON file with a pair of two smiles from eml_canonical, then from 442 smiles, the pairs are 441 and it is giving me NAN/inf at 14 instances that means basically it fails at 7 smiles, because i got NAN/inf at two adjacent pairs because that two adjacent pairs share the same smile.

So the ratio is bit lower. I think we can just put NONE at those fcd_scores. But now i'm just using the multiple_inputs from input gitbook Ersilia and that has just two pairs and it is failing at 1 pair.

GemmaTuron commented 1 year ago

I think it is fine @ZakiaYahya as is, just output None if the distance cannot be calculated

ZakiaYahya commented 1 year ago

Hello @GemmaTuron I'm still just not able to create a JSON or CSV file from eml_canonical that has multiple_input format like in the example below, The one pair has 2 smiles against 1 smile and the other pair has 2 smiles against 2 smiles. So i'm confused how to create a eloborated pair of smiles as an input to further test the model or i just use random smiles and few pairs to test it, Kindly guide me on this

[
    [
        [
            "CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O",
            "C1=CN=CC=C1C(=O)NN"
        ],
        [
            "CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O"
        ]
    ],
    [
        [
            "CC(=O)OC1=CC=CC=C1C(=O)O",
            "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O"
        ],
        [
            "CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C",
            "COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5"
        ]
    ]
]

GemmaTuron commented 1 year ago

Hi @ZakiaYahya We can totally use a random way of pairing the smiles to test the model, perhaps just a few groups is sufficient, simply take the smiles in eml_canonical and prepare a cvs file that has the following groups: 3, 4 2.1 4,5 2,2 7,8 3,9

ZakiaYahya commented 1 year ago

Okay @GemmaTuron Just let me create a input file having these pairs and then i'll let you know. Thanks

ZakiaYahya commented 1 year ago

Hello @GemmaTuron I have created pairs of smile as you suggested above. here's th input file input.csv The corresponding output is here, output.csv

The above inputs didn't contain any smile string that has dot . operand, but when i include any smile in CSV file that has dot . operand than CSV treated it as two separates smiles e.g i included '[K+].[I-]' in a CSV file and csv reads it as['[K+]', '[I-]'] instead of ['[K+].[I-]']. So, in my opinion we need to stick to JSON file as an input rather than CSV as JSON treated it as single smile rather than two smile strings. What do you say @GemmaTuron?

ZakiaYahya commented 1 year ago

Hello @GemmaTuron @DhanshreeA I've tested the model both withrun.sh and with --repo-path using this JSON input file JSON_input.txt

It is also working with smiles having dot.operand in it. Here's the output files for run.sh and--repo_path output_repo_path.csv output_run_sh.csv

Can i start refactoring the model now??

DhanshreeA commented 1 year ago

Hello @GemmaTuron I have created pairs of smile as you suggested above. here's th input file input.csv The corresponding output is here, output.csv

The above inputs didn't contain any smile string that has dot . operand, but when i include any smile in CSV file that has dot . operand than CSV treated it as two separates smiles e.g i included '[K+].[I-]' in a CSV file and csv reads it as['[K+]', '[I-]'] instead of ['[K+].[I-]']. So, in my opinion we need to stick to JSON file as an input rather than CSV as JSON treated it as single smile rather than two smile strings. What do you say @GemmaTuron?

@miquelduranfrigola @GemmaTuron not sure this is a sustainable solution to use JSON where the CSV parsing is incorrect. We should consider changing this behavior at the Ersilia CLI level, what do you think?

DhanshreeA commented 1 year ago

Hello @GemmaTuron @DhanshreeA I've tested the model both withrun.sh and with --repo-path using this JSON input file JSON_input.txt

It is also working with smiles having dot.operand in it. Here's the output files for run.sh and--repo_path output_repo_path.csv output_run_sh.csv

Can i start refactoring the model now??

Sounds good @ZakiaYahya please go ahead!

GemmaTuron commented 1 year ago

mmm this is a challenging one @DhanshreeA A dot implies disconnection, so two atoms separated by a DOT in a SMILES string are in principle not bonded, not connected. But, some molecules that might have ions bound (like sodium (Na), potassium (K) or iodide (I)) might have a dot -- @miquelduranfrigola what do you think?

DhanshreeA commented 1 year ago

mmm this is a challenging one @DhanshreeA A dot implies disconnection, so two atoms separated by a DOT in a SMILES string are in principle not bonded, not connected. But, some molecules that might have ions bound (like sodium (Na), potassium (K) or iodide (I)) might have a dot -- @miquelduranfrigola what do you think?

@GemmaTuron that's very interesting to learn! I always thought ionic bonds are stronger than covalent bonds but I think I understand why they're not considered "bonded" (They're just two charged ions held together by a strong electrical force)

GemmaTuron commented 1 year ago

@DhanshreeA and @ZakiaYahya

We will stay with the . to split groups, since this is the most accepted way of doing it, and we will just have to live with the edge cases where there is a . in the MSILES, the ion will be separated from the rest of the molecule. If anything, we could try to check for one atom molecules that appear after the transformation of the .csv, to discard them, but that would be an extra mile

DhanshreeA commented 1 year ago

Thanks for the clarification @GemmaTuron

What you're suggesting is achievable. For example, after transformation of the CSV, one can convert all smiles into molecules and check for molecules that have single atoms (as you said) and discard the corresponding smile + its consecutive smile (because at that point it'll not be a molecule, just an ion).

A short snippet for that here @ZakiaYahya:

from rdkit import Chem
smiles = [...] # Contains the DOT separated smiles read from a single column
smiles_to_remove = [] # Stores indices
mols = [Chem.MolFromSmiles(smi) for smi in smiles] # Convert to Molecule objects
i = 0
while i<len(mols):
  num_atoms = len(m.GetAtoms())  # Get Atoms in a given Molecule
  if num_atoms == 1:
    smiles_to_remove.extend([i, i+1])
    i+=2
  else:
    i+=1

modified_smiles = []

for idx, smi in enumerate(smiles):
  if idx not in smiles_to_remove:
    modified_smiles.append(smi)

But like Gemma said, it's okay if we don't deal with this for now.

ZakiaYahya commented 1 year ago

Hello @Dhansharee Let me check, it that works, otherwise i'll simply changes the code to take CSV as an input, test it withrun.sh and --repo-pathand open PR on it. Thanks

GemmaTuron commented 1 year ago

Good one @DhanshreeA !

Let me know if it works @ZakiaYahya

ZakiaYahya commented 1 year ago

Hello @GemmaTuron @DhanshreeA Today's update: I've modified main.py according to take CSV file as an input, but as service.py is dealing with JSON file so it still not working with CSV file, i need to modify service.py also so that it process multiple_inputs from CSV as well to run on Ersilia CLI. eos9be7_log.log

GemmaTuron commented 1 year ago

thanks @ZakiaYahya for the update, let me know if you get stuck and need help.

ZakiaYahya commented 1 year ago

Hello @GemmaTuron Thanks for asking. I have tried different ways so that service.py process multiple inputs but i' stuck here, as nothing works and i got errors atsmile_1and smile_2 Here's the original service.py file that deals with JSON service_JSON.txt

and here's the modified that service.py that reads multiple inputs from CSV file service_CSV.txt

The error log is same i.e. eos9be7_log.log

@GemmaTuron is there any service.py file in Ersilia model Hub that processes multiple inputs so i get an idea from it. Or can you go through modified service.py and give me a clue where i'm doing it wrong.

Thanks

DhanshreeA commented 1 year ago

@ZakiaYahya I need to look at the main.py for this. Can you push the code to your fork in its current state? I'll clone your fork and try to debug it locally. Right now looking at the files you have provided, I am unable to tell much.

ZakiaYahya commented 1 year ago

Hello @DhanshreeA Just let me push the changes. Thanks

ZakiaYahya commented 1 year ago

Hello @DhanshreeA I've pushed the changes to my fork, you can check it here https://github.com/ZakiaYahya/eos9be7 FYI: the code is working fine with run.sh, it is failing with --repo-path at Ersilia CLI due to service.py because it is not processing CSV as an input. Can you please check my modified service.py as main.py is working perfectly with CSV now Thanks

GemmaTuron commented 1 year ago

Hi @ZakiaYahya and @DhanshreeA If I understand it correctly, it works with csv but Ersilia is not recognizing that?

ZakiaYahya commented 1 year ago

Hi @GemmaTuron Exactly, it is working with CSV file locally but not at Ersilia level. Because from serve.log it is basically giving error atinference.py and service.py, although in th eroor it throws it shows smile_1 error from main.py but at the backend in my understanding it is because of service.py. So, it just seem like ersilia is not processing the CSV in a format like smile_1 and smile_2, because i have done different variations in service.py so it start taking multiple inputs but somehow it keep throwing me error at smile_1 and smile_2. Let me run it again, so i can share the serve.log with you for your review. I'm just stuck at it now, couldn't able to resolve it yet. Thanks

ersilia-os / eos9be7

Clean UP & Dockerization eos9be7 #1