Get NGD working with ARAX_overlay

dkoslicki commented 4 years ago

[x] start with a stub
[x] get it working with existing NGD code from Finn
[x] add option to introduce virtual edge similar to overlay_clinical_info
[ ] get it working with Dan's code

edeutsch commented 4 years ago

Note that this is probably pretty easy:

Loop over all message.knowledge_graph.nodes[] to build a lookup dict by curie
After that, loop over all message.knowledge_graph.edges[], get the source_id, target_id curies and names from the above dict, and then basically do what the NGD controller does for each pair: https://github.com/RTXteam/RTX/blob/master/code/UI/OpenAPI/python-flask-server/swagger_server/controllers/pubmed_mesh_ngd_controller.py and then insert that information into each message.knowledge_graph.edges[] edge as an EdgeAttribute That's basically it. There will be some details. As you can see in the pubmed_mesh_ngd_controller.py, I had to do a chdir to get it to work right. I'm guessing the it didn't work right without that. Or maybe requestscache always put the cache dir in the cwd, so if the cwd changes depending on how the whole system starts, that might affect things (remember, when the whole API interface starts, its cwd is not "kg-construction".)

edeutsch commented 4 years ago

It's probably just 30 lines of code. I'd just do it all in ARAX_overlay.py for now so you don't have the overhead of another class at this point..

edeutsch commented 4 years ago

i.e. just modify/expand a little lines 76-93 in situ before trying anything fancy with other classes.

dkoslicki commented 4 years ago

@edeutsch Yup, your plan is exactly what I have in mind. However, I would like to do it in a separate class since:

I want to keep ARAX_overlay.py as thin as possible given the number of different overlay components I have in mind.
I want the practice of calling a separate class for each overlay component as that's the paradigm I/we had in mind per the last hackathon
I'm going to test integrating Dan Lin's NGD code and it may be a bit complicated (i.e. not quite just 30 lines and a call out).

edeutsch commented 4 years ago

okay, sounds good.

dkoslicki commented 4 years ago

Stub complete in e4161f84 (note to self, in commit messages, issues don't get referenced if there's a space between the # and the number).

dkoslicki commented 4 years ago

It's working, but is super slow, as expected. Blocking on a setup and demo script from @power10dan as after I have that, I'll switch out the old NGD for the shiny new and improved one!

dkoslicki commented 4 years ago

@amykglen If you’ve made any attempts or run into problems trying to integrate Dan’s code for ngd, please keep me posted here. Depending on the difficulties you run into, I may give it a go myself, and then we can decide if we scrap it until post-demo or not.

amykglen commented 4 years ago

Will do! I haven't tried yet (was traveling the last few days), but plan to dive in tomorrow. Will keep you updated!

amykglen commented 4 years ago

An update (or lack thereof): I've been working on trying to integrate Dan's code this afternoon, but have not made much (or any) progress. I've been trying to get sample_workflow.py to run, but haven't gotten there yet.. (been working through errors - for example, the function in the file refers to 'self', but it is not part of a class..)

Generally pretty confused trying to figure out what's going on in the code, and how to get set up... for example, config_file.py indicates that you need to run 'your pyspark server' somewhere (currently set to localhost:6379), but I haven't figured out what the 'pyspark server' is and how to run it.

amykglen commented 4 years ago

So I bailed on working with sample_workflow.py and instead am trying to start by running XMLExtractorTests.py... have been making slightly more progress on this front, but the latest error I'm getting is:

ImportError: cannot import name 'spark_redis_jar' from 'config_file'

There seems to be no variable in config_file.py called spark_redis_jar, but SparkObjectBuilder.py tries to import such a variable...

(It's seeming that Redis is a key piece of the system(?), but I don't see a note on that anywhere ... apparently port 6379 is the default Redis port...)

amykglen commented 4 years ago

Huh, apparently spark_redis_jar was deleted in this commit: https://github.com/RTXteam/RTX/commit/76048ab39ebe816a5ab793a40e51970500243c78

amykglen commented 4 years ago

(I went ahead and deleted the spark_redis_jar import as it actually doesn't seem to be used)

So I am now able to run XMLExtractorTests.py (just using a small dataset from PubMed for now, to see if I can get going before worrying about all of PubMed): the function test_read_update_df in that file seems to run ok, but then the rest of the functions in there error out like so:

Traceback (most recent call last):
  File "XMLExtractorTests.py", line 45, in test_read_entire_schema
    df_entire = self.xml_extractor.get_entire_df()
  File "/Users/aglen/translator/RTX/RTX/code/reasoningtool/QuestionAnswering/XMLExtractor.py", line 263, in get_entire_df
    self.df_sample)
AttributeError: 'Spark_XML_ETL' object has no attribute 'df_sample'

dkoslicki commented 4 years ago

Note to self: would be really useful to create classes like add_kg_edge and add_kg_edge_attribute and call them in all the overlay functions. But, you know, time constraints...

edeutsch commented 4 years ago

actions really should not be classes themselves. Rather classes should things that can perform actions, which are the methods of the class. I already have the (somewhat fancifully named) ARAXMessenger class, which has methods like add_qedge(). It would natural to put add_kg_edge() or even just add_edge() there.

In theory add_edge() should be a method of the KnowledgeGraph class, but this is a little tricky because that class code is autogenerated by the Swagger framework, so we shouldn't manually edit knowledge_graph.py and similar files. There are probably some elegant ways to solve this problem. But for now I just had made the ARAXMessenger class.

dkoslicki commented 4 years ago

@edeutsch True. I was mostly referring to the classes/modules in the Overlay subfolder that could contain a module/class that handles the addition of QG edges and KG edges depending on:

The action that was called
The sub-action that the action facilitates (eg. COHD can do many different things)
Custom handling of minor discrepancies between overlay actions.

Basically, I am noticing that the *.py scripts in the ARAXQuery/Overlay folder have a lot of duplicate code that only slightly varies and figured there might be an elegant way to get around this. Currently overlay_clinical_info.py is the "cleanest" one so far given that the individual COHD sub-actions paired_concept_freq, observed_expected_ratio and chi_square were almost trivial to implement after I got OverlayClinicalInfo().make_edge_attribute_from_curies(), OverlayClinicalInfo().add_virtual_edge() and OverlayClinicalInfo().add_all_edges() working

dkoslicki commented 4 years ago

Closing as now the work is being done in #654 for NGD

RTXteam / RTX

Get NGD working with ARAX_overlay #605