RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

Deploy FastNGD on arax.rtx.ai #715

Closed edeutsch closed 4 years ago

edeutsch commented 4 years ago

I think we established at today's call that FastNGD is not deployed on arax.rtx.ai, although it seems that some thought it was, or in any case it seems like it is ready to be deployed, although maybe still room for enhancements.

I'm happy to help deploy it, but I have no knowledge about how the pickledb gets created or where it might be copied from and where it should live, etc.

amykglen commented 4 years ago

Did some digging here:

But I believe we concluded on Friday that there don't seem to be any .db files on arax.rtx.ai... so I guess what needs to be done is get the two pickle db files (curie_to_mesh.db (1MB) and mesh_to_pmid.db (735MB)) from /home/ubuntu on pubmed.rtx.ai and put them on arax.rtx.ai... as for where on arax.rtx.ai, I'm not exactly sure what @saramsey intended?

edeutsch commented 4 years ago

My suggestion is to put the .db files on arax.rtx.ai in: outside container: /data/orangeboard/PubMed/ inside container: /mnt/data/orangeboard/PubMed/

Then let me know when they are there, and I will create sym links in the various RTX/code/reasoningtool/kg-construction/ locations for production, beta, production, etc.

Then, whenever you have updated versions of these databases, you just update that one location above and they will be live everywhere.

Sound good?

amykglen commented 4 years ago

Ah, nice, that sounds good to me. So when I ssh into arax.rtx.ai, I see that the two .db files actually ARE in there already, at /home/ubuntu.

I'm not familiar with docker and this is the first time I've logged into arax.rtx.ai, so apologies for my ignorance here - would getting them where you want simply be a matter of copying/moving them? like:

cp curie_to_mesh.db /data/orangeboard/PubMed/
cp mesh_to_pmid.db /data/orangeboard/PubMed/
edeutsch commented 4 years ago

that's right!

amykglen commented 4 years ago

Ok cool - the files are in there now!

edeutsch commented 4 years ago

Great, thanks! I actually moved it to a subdirectory called FastNGD to keep things a little tidier.

I have put in sym links and tested with: add_qnode(name=lovastatin, id=n00) add_qnode(id=n01) add_qedge(source_id=n00, target_id=n01, id=e00) expand(edge_id=e00) overlay(action=compute_ngd, virtual_edge_type=N1, source_qnode_id=n00, target_qnode_id=n01) resultify(ignore_edge_direction=true)

But unfortunately got this: An uncaught error occurred: signal only works in main thread: ['Traceback (most recent call last):\n', ' File "/mnt/data/orangeboard/devED/RTX/code/UI/OpenAPI/python-flask-server/swagger_server/../../../../ARAX/ARAXQuery/ARAX_query.py", line 408, in executeProcessingPlan\n result = overlay.apply(message, action[\'parameters\'])\n', ' File "/mnt/data/orangeboard/devED/RTX/code/UI/OpenAPI/python-flask-server/swagger_server/../../../../ARAX/ARAXQuery/ARAXoverlay.py", line 135, in apply\n getattr(self, \'\' + self.class.name + \'\' + parameters[\'action\'])() # thank you https://stackoverflow.com/questions/11649848/call-methods-by-string\n', ' File "/mnt/data/orangeboard/devED/RTX/code/UI/OpenAPI/python-flask-server/swagger_server/../../../../ARAX/ARAXQuery/ARAX_overlay.py", line 209, in compute_ngd\n NGD = ComputeNGD(self.response, self.message, parameters)\n', ' File "/mnt/data/orangeboard/devED/RTX/code/UI/OpenAPI/python-flask-server/swagger_server/../../../../ARAX/ARAXQuery/Overlay/compute_ngd.py", line 27, in init\n self.NGD = NormGoogleDistance.NormGoogleDistance() # should I be importing here, or before the class? Feel like Eric said avoid global vars...\n', ' File "/mnt/data/orangeboard/devED/RTX/code/reasoningtool/kg-construction/NormGoogleDistance.py", line 37, in init\n auto_dump=False)\n', ' File "/mnt/data/python/Python-3.7.3/lib/python3.7/site-packages/pickledb.py", line 43, in load\n return PickleDB(location, auto_dump, sig)\n', ' File "/mnt/data/python/Python-3.7.3/lib/python3.7/site-packages/pickledb.py", line 57, in init\n self.set_sigterm_handler()\n', ' File "/mnt/data/python/Python-3.7.3/lib/python3.7/site-packages/pickledb.py", line 77, in set_sigterm_handler\n signal.signal(signal.SIGTERM, sigterm_handler)\n', ' File "/mnt/data/python/Python-3.7.3/lib/python3.7/signal.py", line 47, in signal\n handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))\n', 'ValueError: signal only works in main thread\n']

edeutsch commented 4 years ago

hmm, maybe this is the solution: https://github.com/patx/pickledb/issues/50

edeutsch commented 4 years ago

I added in the sig=False parameter and now the Pickledb code no longer crashes, so that's good. But it's not obvious that FastNGD is otherwise working. It still seems to be invoking eutils and taking a long time. Logging says:

2020-04-22 17:56:46.001369 INFO: Processing action 'overlay' with parameters {'action': 'compute_ngd', 'virtual_edge_type': 'N1', 'source_qnode_id': 'n00', 'target_qnode_id': 'n01'}
2020-04-22 17:56:52.097455 DEBUG: Computing NGD
2020-04-22 17:56:52.097484 INFO: Computing the normalized Google distance: weighting edges based on source/target node co-occurrence frequency in PubMed abstracts
2020-04-22 17:56:52.097491 INFO: Converting CURIE identifiers to human readable names
2020-04-22 17:56:52.097547 WARNING: Utilizing API calls to NCBI eUtils, so this may take a while...
2020-04-22 17:57:38.863209 DEBUG: Applying Overlay to Message with parameters {'action': 'compute_ngd', 'virtual_edge_type': 'N1', 'source_qnode_id': 'n00', 'target_qnode_id': 'n01', 'default_value': inf}
2020-04-22 17:57:38.863803 DEBUG: Query graph is {'edges': [{'id': 'e00', 'negated': None, 'relation': None, 'source_id': 'n00', 'target_id': 'n01', 'type': None}, {'id': 'N1', 'negated': None, 'relation': 'ngd', 'source_id': 'n00', 'target_id': 'n01', 'type': 'N1'}], 'nodes': [{'curie': 'CHEMBL.COMPOUND:CHEMBL503', 'id': 'n00', 'is_set': None, 'type': 'chemical_substance'}, {'curie': None, 'id': 'n01', 'is_set': None, 'type': None}]}
2020-04-22 17:57:38.863820 DEBUG: Number of nodes in KG is 62
2020-04-22 17:57:38.863896 DEBUG: Number of nodes in KG by type is Counter({'protein': 29, 'phenotypic_feature': 21, 'disease': 11, 'chemical_substance': 1})
2020-04-22 17:57:38.863905 DEBUG: Number of edges in KG is 151
2020-04-22 17:57:38.863998 DEBUG: Number of edges in KG by type is Counter({'N1': 61, 'physically_interacts_with': 58, 'contraindicated_for': 26, 'indicated_for': 6})
2020-04-22 17:57:38.864046 DEBUG: Number of edges in KG with attributes is 61 2020-04-22 17:57:38.864147 DEBUG: Number of edges in KG by attribute Counter({'ngd': 61})

Do we know of a test query that is known to use 100% FastNGD?

edeutsch commented 4 years ago

I just pushed this fix to NormGoogleDistance (for multithreading), but the underlying FastNGD miss is not solved.

amykglen commented 4 years ago

I don't think Chembl has any coverage currently (though the next enhancement will change that I think) - I think DOID and OMIM are some of the better options...

You can see what David used for testing in this comment

edeutsch commented 4 years ago

okay, thanks, trying this: add_qnode(curie=DOID:10223, id=n00) add_qnode(type=phenotypic_feature, is_set=True, id=n01) add_qedge(source_id=n00, target_id=n01, id=e00, type=has_phenotype) expand(edge_id=e00) overlay(action=compute_ngd)

But, I guess we don't expect that to be fast either. I wonder if we know of a query where we get 100% FastNGD hits so it's fast and we know it's really working?

Also, I suppose I assumed we were essentially reimplementing what they do at NCBI so our coverage would be identical to theirs? I'm coming to the realization that that is not really the case?

edeutsch commented 4 years ago

That query also comes back with something after a long while, so not really clear what's going on.

I notice that all the edge attributes look like this: "edge_attributes": [ { "name": "ngd", "type": "float", "url": "https://arax.rtx.ai/api/rtx/v1/ui/#/PubmedMeshNgd", "value": "0.4408345706373374" } ],

I wonder if it would be a good idea if NGDs that come from FastNGD look a little different than those coming from eutils? Maybe they are and I'm not seeing it? But maybe a different URL would be nice. That URL isn't really correct anyway, and I doubt anyone is using it, so maybe in the interim while we're testing we can have two different URLs for FastNGD and eutils NGD so we can see how it's working?