Closed GemmaTuron closed 1 year ago
Hi @febielin It is not exactly web scraping, we simply do not want to download the whole database and search it online. It is expected that some molecules will not have similar ones, could you let me know what proportion more or less it is? I think the error with colab lies in trying to convert the output to pandas, have you tried without specifying the output?
@GemmaTuron,
Thank you for your comments.
For the Docker output, around 1/6th is missing. For the CLI output, closer to 1/5th. As mentioned above, while there is some overlap in the molecules that come out empty between the two output files, I am also finding a significant amount of molecules that are present in one output file but empty in the other. The model is not very consistent in this sense.
As for the colab notebook, I don't believe it has to do with output conversion. The step at which the model fails every time is during predictions, so it quite never reaches the output line.
Hi @febielin
Just check one of the molecules that does not have output on their web server and we can know if its on their side or ours.
Hi @GemmaTuron,
I do not believe the issue is on the web server side.
As I mentioned, both the CLI and Docker outputs had null fields. I posted molecules that produced null outputs on both CLI and Docker, such as the following:
All of these produced outputs on the server. I ran molecules 3-5 times to ensure that it wasn't the case that the molecules sometimes work and sometimes not work. The server seems to produce results pretty reliably. I need to double check with Riley to see if he also has the same findings for the GDBChEMBL model. If so, we may need to collaborate to identify why the output is not getting returned to the Ersilia side.
So we get an answer from the Server but not from Ersilia? that is indeed surprising, let's talk about this in the meeting today.
@febielin I am unable to run predictions on the server for the molecules you share, could you double check? and share the predictions you got so I can compare, thanks!
Cn1cnc(c1Sc2ncnc3nc[nH]c23)[N+]([O-])=O
- linkc1Sc2ncnc3nc[nH]c23&fp=ECfp4&db=GDBChEMBL&nnc=100)COc1ccc(CC2c3cc(OC)c(OC)cc3CC[N+]2(C)CCC(=O)OCCCCCOC(=O)CC[N+]4(C)CCc5cc(OC)c(OC)cc5C4Cc6ccc(OC)c(OC)c6)cc1OC
- linkcc6OC&fp=ECfp4&db=GDBChEMBL&nnc=100)[Ca++].NC1=NC(=O)C2=C(NCC(CNc3ccc(cc3)C(=O)N[C@@H](CCC([O-])=O)C([O-])=O)N2C=O)N1
- linkC(=O)[OMINUS])cc1)N2C=O)[nH]3&fp=ECfp4&db=GDBChEMBL&nnc=100)[Ca++].OC[C@@H](O)[C@@H](O)[C@H](O)[C@@H](O)C([O-])=O.OC[C@@H](O)[C@@H](O)[C@H](O)[C@@H](O)C([O-])=O
- link@GemmaTuron I realize that what's happening is that Github is turning part of the smiles into hyperlinks so portions of the molecule are missing - resulting in invalid inputs. I've listed the same molecules as code now, so hopefully it is complete.
Hi @febielin Precisely, the issue is that we are not passing those smiles in the correct format to the URL. We need to convert them with the following code (see example): import requests import urllib
Lines = [
"[Ca++].NC1=NC(=O)C2=C(NCC(CNc3ccc(cc3)C(=O)N[C@@H](CCC([O-])=O)C([O-])=O)N2C=O)N1"
]
fp = 'ECfp4'
db = 'GDBChEMBL'
nnc = '100'
data = []
for input_smiles in Lines:
input_smiles = input_smiles.strip()
url_encoded_smiles = urllib.parse.quote(input_smiles)
url = 'https://gdb-chembl-simsearch.gdb.tools/search?smi=' + url_encoded_smiles + '&fp=' + fp + '&db=' + db + '&nnc=' + nnc
print(url)
#r = requests.get(url)
This model appears to pass all actions. It fetches and serves well. However, there are issues when running the eml_canonical file.
I also see that this is one of the models that post to the online server. I wonder if there are issues with web scraping.