Clean UP & Dockerization eos4b8j

febielin commented 1 year ago

This model appears to pass all actions. It fetches and serves well. However, there are issues when running the eml_canonical file.

On CLI, there are missing outputs: eos4b8j_output.csv
On Docker, there are also missing outputs (not necessarily the same missing outputs as CLI): eos4b8j_docker_output.csv
On Colab, I get the following error (link to notebook): ValueError: Columns must be same length as keys

I also see that this is one of the models that post to the online server. I wonder if there are issues with web scraping.

GemmaTuron commented 1 year ago

Hi @febielin It is not exactly web scraping, we simply do not want to download the whole database and search it online. It is expected that some molecules will not have similar ones, could you let me know what proportion more or less it is? I think the error with colab lies in trying to convert the output to pandas, have you tried without specifying the output?

febielin commented 1 year ago

@GemmaTuron,

Thank you for your comments.

For the Docker output, around 1/6th is missing. For the CLI output, closer to 1/5th. As mentioned above, while there is some overlap in the molecules that come out empty between the two output files, I am also finding a significant amount of molecules that are present in one output file but empty in the other. The model is not very consistent in this sense.

As for the colab notebook, I don't believe it has to do with output conversion. The step at which the model fails every time is during predictions, so it quite never reaches the output line.

GemmaTuron commented 1 year ago

Hi @febielin

Just check one of the molecules that does not have output on their web server and we can know if its on their side or ours.

febielin commented 1 year ago

Hi @GemmaTuron,

I do not believe the issue is on the web server side.

As I mentioned, both the CLI and Docker outputs had null fields. I posted molecules that produced null outputs on both CLI and Docker, such as the following:

Cn1cnc(c1Sc2ncnc3nc[nH]c23)N+=O
COc1ccc(CC2c3cc(OC)c(OC)cc3CC[N+]2(C)CCC(=O)OCCCCCOC(=O)CC[N+]4(C)CCc5cc(OC)c(OC)cc5C4Cc6ccc(OC)c(OC)c6)cc1OC
[Ca++].NC1=NC(=O)C2=C(NCC(CNc3ccc(cc3)C(=O)NC@@HC([O-])=O)N2C=O)N1
[Ca++].OCC@@H C@@H C@H C@@HC([O-])=O.OCC@@H C@@H C@H C@@HC([O-])=O

All of these produced outputs on the server. I ran molecules 3-5 times to ensure that it wasn't the case that the molecules sometimes work and sometimes not work. The server seems to produce results pretty reliably. I need to double check with Riley to see if he also has the same findings for the GDBChEMBL model. If so, we may need to collaborate to identify why the output is not getting returned to the Ersilia side.

GemmaTuron commented 1 year ago

So we get an answer from the Server but not from Ersilia? that is indeed surprising, let's talk about this in the meeting today.

GemmaTuron commented 1 year ago

@febielin I am unable to run predictions on the server for the molecules you share, could you double check? and share the predictions you got so I can compare, thanks!

febielin commented 1 year ago

Cn1cnc(c1Sc2ncnc3nc[nH]c23)[N+]([O-])=O - linkc1Sc2ncnc3nc[nH]c23&fp=ECfp4&db=GDBChEMBL&nnc=100)
COc1ccc(CC2c3cc(OC)c(OC)cc3CC[N+]2(C)CCC(=O)OCCCCCOC(=O)CC[N+]4(C)CCc5cc(OC)c(OC)cc5C4Cc6ccc(OC)c(OC)c6)cc1OC - linkcc6OC&fp=ECfp4&db=GDBChEMBL&nnc=100)
[Ca++].NC1=NC(=O)C2=C(NCC(CNc3ccc(cc3)C(=O)N[C@@H](CCC([O-])=O)C([O-])=O)N2C=O)N1 - linkC(=O)[OMINUS])cc1)N2C=O)[nH]3&fp=ECfp4&db=GDBChEMBL&nnc=100)
[Ca++].OC[C@@H](O)[C@@H](O)[C@H](O)[C@@H](O)C([O-])=O.OC[C@@H](O)[C@@H](O)[C@H](O)[C@@H](O)C([O-])=O - link

@GemmaTuron I realize that what's happening is that Github is turning part of the smiles into hyperlinks so portions of the molecule are missing - resulting in invalid inputs. I've listed the same molecules as code now, so hopefully it is complete.

GemmaTuron commented 1 year ago

Hi @febielin Precisely, the issue is that we are not passing those smiles in the correct format to the URL. We need to convert them with the following code (see example): import requests import urllib

Lines = [
    "[Ca++].NC1=NC(=O)C2=C(NCC(CNc3ccc(cc3)C(=O)N[C@@H](CCC([O-])=O)C([O-])=O)N2C=O)N1"
]

fp     = 'ECfp4'
db     = 'GDBChEMBL'
nnc    = '100'

data = []
for input_smiles in Lines:
    input_smiles = input_smiles.strip() 
    url_encoded_smiles = urllib.parse.quote(input_smiles)

    url = 'https://gdb-chembl-simsearch.gdb.tools/search?smi=' + url_encoded_smiles +  '&fp=' + fp + '&db=' + db + '&nnc=' + nnc
    print(url)
    #r = requests.get(url)

ersilia-os / eos4b8j

Clean UP & Dockerization eos4b8j #1