Move ARAX to KG2.5.2 - Githubissues

amykglen commented 3 years ago

documentation about what things will need updating is at: https://github.com/RTXteam/RTX/wiki/Deployment-info#things-that-need-updating-when-rolling-out-a-new-kg2-version

rebuild/edit these files and put them in /translator/data/orangeboard/databases/KG2.5.2 on arax.ncats.io:

[x] configv2.json
[x] NodeSynonymizer (node_synonymizer.sqlite)
[x] NGD database (curie_to_pmids.sqlite)
[x] COHD database (COHDdatabase_vX.X.db) @chunyuma
[x] DTD model (LogModel.pkl, GRAPH.sqlite) @chunyuma
[ ] DTD database (DTD_probability_database.db) @chunyuma

Note: As databases are rebuilt, the new copy of configv2.json will need to be updated to point to their new paths.

other to do's:

[x] build new version of KG2c (will be put up at http://kg2c-5-2.rtx.ai:7474/browser/)
[x] make any necessary changes to the code base (put these in the kg2.5integration branch)
[x] rebuild the allowed node/edge type CSVs and push to the kg2.5integration branch @finnagin

finally:

[x] once all of the above is done, test everything together (new code, new databases, etc.) in a single dev environment
[x] then finally roll all this out to production

amykglen commented 3 years ago

ok, @chunyuma - a new KG2c built from KG2.5.1 is up at: http://kg2canonicalized2.rtx.ai:7474/browser/

note that these node properties have been renamed like so:

preferred_type -> category
types -> all_categories

chunyuma commented 3 years ago

Thanks @amykglen!

saramsey commented 3 years ago

Tagging @kvarforl to also read this issue, as a part of cross-training on the ARAX system and its operations/deployment. In particular, it would be good for @kvarforl to get up-to-speed on how to build KG2c (and verify that the build was successful) and the NodeSynonymizer database.

finnagin commented 3 years ago

Generated the csvs for KG2C looks like it has ~33,000 rows in the allowed predicate triples.

amykglen commented 3 years ago

rebuild of kg2c.5.2 with the tweaked NodeSynonymizer is done, and things seem to look good! looks like the category_manager changes @edeutsch made for conflation of ancestors and such were successful, maybe with one very minor exception - just a handful of nodes remain that are "ChemicalSubstance"s but are missing "Drug" in their expanded_categories:

match (n:`biolink:ChemicalSubstance`) where not "biolink:Drug" in n.expanded_categories return n.id, n.name, n.expanded_categories

n.id	n.name	n.expanded_categories
"ttd.target:CAR-T Cells targeting Mesothelin"	"CAR-T Cells targeting Mesothelin"	["biolink:ChemicalSubstance"]
"ttd.target:CAR-T cells targeting Mesothelin"	"CAR-T cells targeting Mesothelin"	["biolink:ChemicalSubstance"]
"ttd.target:CD19-targeting CAR T cells"	"CD19-targeting CAR T cells"	["biolink:ChemicalSubstance"]
"ttd.target:CD19 CAR-T cells"	"CD19 CAR-T cells"	["biolink:ChemicalSubstance"]
"ttd.target:CD19 CAR T cells"	"CD19 CAR T cells"	["biolink:ChemicalSubstance"]
"ttd.target:CAR-T Cells targeting EpCAM"	"CAR-T Cells targeting EpCAM"	["biolink:ChemicalSubstance"]
"ttd.target:CAR-T cells targeting CEA"	"CAR-T cells targeting CEA"	["biolink:ChemicalSubstance"]
"ttd.target:CAR-T Cells targeting CD19"	"CAR-T Cells targeting CD19"	["biolink:ChemicalSubstance"]
"ttd.target:CAR-T Cells targeting PSCA"	"CAR-T Cells targeting PSCA"	["biolink:ChemicalSubstance"]
"ttd.target:CAR-T cells targeting Muc1"	"CAR-T cells targeting Muc1"	["biolink:ChemicalSubstance"]
"ttd.target:CAR-GPC3 T Cells"	"CAR-GPC3 T Cells"	["biolink:ChemicalSubstance"]
"ttd.target:CAR-T cells targeting HER2"	"CAR-T cells targeting HER2"	["biolink:ChemicalSubstance"]
"ttd.target:CAR-T cells targeting MucI"	"CAR-T cells targeting MucI"	["biolink:ChemicalSubstance"]

I think this definitely doesn't need to stop us from continuing with this roll out (such a small number of affected nodes), but reporting these here for @edeutsch.

I'm getting close to having the pytest suite passing again in kg2.5integration - will likely reach out to a couple people shortly about some lingering issues.

edeutsch commented 3 years ago

awesome news, @amykglen ! Looking into this a little, it looks to me that NodeSynonymizer get_equivalent_nodes(curies=curies,return_all_categories=True) appears to be doing the right thing:

{
  "ttd.target:CAR-T Cells targeting Mesothelin": {
    "all_categories": {
      "biolink:ChemicalSubstance": 1
    },
    "expanded_categories": {
      "biolink:BiologicalEntity": true,
      "biolink:ChemicalSubstance": true,
      "biolink:Drug": true,
      "biolink:MolecularEntity": true,
      "biolink:NamedThing": true
    },
    "ttd.target:CAR-T Cells targeting Mesothelin": "biolink:ChemicalSubstance"
  }
}

So I'm puzzled why KG2C isn't showing that..

But one slightly alarming thing I notice is that all these CURIEs all have spaces in them! Is that allowed? I vaguely recall writing code in the past assuming that CURIEs don't have space in them.

I wonder if this space is somehow related?

amykglen commented 3 years ago

huh, very weird! I'll have to look more into why those aren't making it into KG2c... not seeing an obvious reason.

amykglen commented 3 years ago

hmm, ok, think I've narrowed it down here. so when I send only the curie 'ttd.target:CAR-T Cells targeting Mesothelin' to get_canonical_curies() like you did, I get the same response as you. but when I send that curie PLUS one of its equivalent curies (the same but with lowercase 'cells'), it then claims it doesn't recognize 'ttd.target:CAR-T Cells targeting Mesothelin'. i.e., doing:

curies = ['ttd.target:CAR-T Cells targeting Mesothelin', 'ttd.target:CAR-T cells targeting mesothelin']
print(synonymizer.get_canonical_curies(curies,  return_all_categories=True))

prints out:

{
  "ttd.target:CAR-T Cells targeting Mesothelin":"None",
  "ttd.target:CAR-T cells targeting mesothelin":{
    "preferred_curie":"ttd.target:CAR-T Cells targeting Mesothelin",
    "preferred_name":"CAR-T Cells targeting Mesothelin",
    "preferred_category":"biolink:ChemicalSubstance",
    "expanded_categories":{
      "biolink:ChemicalSubstance":true,
      "biolink:Drug":true,
      "biolink:MolecularEntity":true,
      "biolink:BiologicalEntity":true,
      "biolink:NamedThing":true
    },
    "all_categories":{
      "biolink:ChemicalSubstance":1
    }
  }
}

so because 'canonical info' isn't returned for 'ttd.target:CAR-T Cells targeting Mesothelin', KG2c defaults to saying that all_categories and expanded_categories must just contain the node's category from KG2. (the reasoning behind having this logic in place was mainly because some nodes in KG2 don't have names, and thus wouldn't be recognized by the synonymizer, but nonetheless should appear in KG2c).

perhaps related, these curies also seem to be affected by an issue I think I've reported before (#1074):

when I do

curies = ['ttd.target:CAR-T Cells targeting Mesothelin', 'ttd.target:CAR-T cells targeting mesothelin']
print(synonymizer.get_equivalent_nodes(curies))

this is what's printed

{
  "ttd.target:CAR-T Cells targeting Mesothelin":"None",
  "ttd.target:CAR-T cells targeting mesothelin":{
    "ttd.target:CAR-T Cells targeting Mesothelin":"biolink:ChemicalSubstance"
  }
}

so you can see the curie with the uppercase 'Cells' is listed as an equivalent curie for the curie with the lowercase 'cells', but not vice versa.

amykglen commented 3 years ago

ok, code changes are largely finished in kg2.5integration, but a few more tests are still failing (think these belong to @chunyuma and @finnagin ?):

FAILED test_ARAX_expand.py::test_COHD_expand_chi_square - AssertionError: assert ('ERROR' == 'OK'
FAILED test_ARAX_expand.py::test_DTD_expand_1 - AssertionError: assert ('ERROR' == 'OK'
FAILED test_ARAX_filter_kg.py::test_error - AssertionError: assert 'OK' == 'ERROR'
FAILED test_ARAX_overlay.py::test_FET_ex1 - AssertionError: assert 'ERROR' == 'OK'
FAILED test_ARAX_workflows.py::test_example_3 - AssertionError: assert 'ERROR' == 'OK'
FAILED test_ARAX_workflows.py::test_FET_ranking - AssertionError: assert 'ERROR' == 'OK'

maybe some of these should be failing, since we decided to move forward without waiting for DTD? but would be great if @chunyuma and @finnagin could take a look at any tests of theirs in this list and push fixes as appropriate to kg2.5integration.

(note that in order to reproduce these, you will likely need to be using the configv2.json and node_synonymizer.sqlite that are in /translator/data/orangeboard/databases/KG2.5.2 - that's what I'm using)

finnagin commented 3 years ago

Looks like test_ARAX_filter_kg.py::test_error is failing because the edge predicate that is being returned in message.knowledge_graph is contraindicated_for rather than biolink:contraindicated_for so the filter does not remove the edge it is supposed to. Is this something that is expected? I thought we were supposed to be only working with biolink:contraindicated_for now that we are switched to KG2.5.2.

For reference the query is:

add_qnode(name=DOID:1227, key=n00)
add_qnode(category=biolink:ChemicalSubstance, key=n01)
add_qedge(subject=n00, object=n01, key=e00)
expand(edge_key=e00, kp=ARAX/KG1)
resultify(ignore_edge_direction=true)
return(message=true, store=false)

finnagin commented 3 years ago

It looks like it is correct in kg2canonicalized2 not sure why this is returning predicated without the biolink: prefix. Maybe something wrong with my setup? I'm using the configv2.json in /translator/data/orangeboard/databases/KG2.5.2. Could you try running that query and see if you also get contraindicated_for edges without the biolink prefix back?

finnagin commented 3 years ago

Lol nevermind just realized it's expanding on KG1 which is why this is returning the non biolink prefix edges. Though this brings up an interesting question: Do we want to have a patch that converts the results of KG1 expand calls or are we in the process of phasing out KG1 anyways so it's not worth it?

edeutsch commented 3 years ago

I think we just discontinue use of KG1 now. So I suggest changing all tests away from KG1 and we turn it off eventually.

finnagin commented 3 years ago

Yeah, that makes sense. From a quick grep it looks like ARAX/KG1 is use serveral times throughout test_ARAX_expand, test_ARAX_resultify, test_ARAX_workplows, test_ARAX_filter_kg, test_ARAX_filter_results, and test_ARAX_overlay. This may take a bit of refactoring. Do we want to make a separate issue for converting all of these?

dkoslicki commented 3 years ago

@finnagin I could go either way with making a new ticket for the KG1 in the tests: it blocks this issue, but I imagine a simple sed -i or manually replace of KG1->KG2 should work (and a slight adjustment of the asserts. Do you think more refactoring than that will be required?

amykglen commented 3 years ago

I would vote for creating a separate issue for that; doesn't need to hold up this issue. I think it would likely break a lot of tests (since it changes their output), and it's also going to slow the test suite waayy down. may be smart to wait until we have a faster system for KG2c in place.

finnagin commented 3 years ago

@dkoslicki I do think more refactoring will likely be required. As @amykglen said we will likely need to adjust a few tests to speed them up. For example I was just changing test_warning in test_ARAX_filter_kg.py and it is now incredibly slow as it uses overlay(action=add_node_pmids) which apparently hits NCBIeutils and thus takes a long time with the number of results returned.

finnagin commented 3 years ago

@amykglen I was able to fix test_ARAX_workflows.py::test_example_3 by changing it to use KG2 instead of KG1 however now it takes ~218s to complete so I have added a pytest slow marker to it. Does this seem reasonable for now while we wait for a faster KG2c system?

edeutsch commented 3 years ago

Regarding this from @amykglen

this is what's printed
{
"ttd.target:CAR-T Cells targeting Mesothelin":"None",
"ttd.target:CAR-T cells targeting mesothelin":{
"ttd.target:CAR-T Cells targeting Mesothelin":"biolink:ChemicalSubstance"
}
}
so you can see the curie with the uppercase 'Cells' is listed as an equivalent curie for the curie with the lowercase 'cells', but not vice versa.

I understand why this is happening. There is an underlying assumption in NodeSynonymizer that CURIEs can be treated as case insensitive. Here we have two CURIEs that differ only in case. @saramsey already flagged this as a potential bug? I'm thinking that NodeSynonymizer code doesn't need to change, and this will no longer be a problem if we can resolve these weird CURIEs.

thoughts?

amykglen commented 3 years ago

@finnagin - that makes good sense to me!

chunyuma commented 3 years ago

ok, code changes are largely finished in kg2.5integration, but a few more tests are still failing (think these belong to @chunyuma and @finnagin ?):
FAILED test_ARAX_expand.py::test_COHD_expand_chi_square - AssertionError: assert ('ERROR' == 'OK'
FAILED test_ARAX_expand.py::test_DTD_expand_1 - AssertionError: assert ('ERROR' == 'OK'
FAILED test_ARAX_filter_kg.py::test_error - AssertionError: assert 'OK' == 'ERROR'
FAILED test_ARAX_overlay.py::test_FET_ex1 - AssertionError: assert 'ERROR' == 'OK'
FAILED test_ARAX_workflows.py::test_example_3 - AssertionError: assert 'ERROR' == 'OK'
FAILED test_ARAX_workflows.py::test_FET_ranking - AssertionError: assert 'ERROR' == 'OK'
maybe some of these should be failing, since we decided to move forward without waiting for DTD? but would be great if @chunyuma and @finnagin could take a look at any tests of theirs in this list and push fixes as appropriate to kg2.5integration.

(note that in order to reproduce these, you will likely need to be using the configv2.json and node_synonymizer.sqlite that are in /translator/data/orangeboard/databases/KG2.5.2 - that's what I'm using)

Hi @amykglen, I fixed most of the failing tests you posted. Only one can't be fixed right now until the new DTD model and database are re-built is test_ARAX_expand.py::test_DTD_expand_1.

      "add_qnode(name=acetaminophen, key=n0)",
      "add_qnode(name=Sotos syndrome, key=n1)",
      "add_qedge(subject=n0, object=n1, key=e0)",
      "expand(edge_key=e0, kp=DTD, DTD_threshold=0, DTD_slow_mode=True)",
      "return(message=true, store=false)"

In this case, it expands a virtual edge from acetaminophen to Sotos syndrome based on DTD probability > 0. However, the preferred curie of Acetaminophen is RXNORM:161 based on Nodesynonymizer v2.5.2. However, in v2.3.4, the preferred curie of ACETAMINOPHEN is CHEMBL.COMPOUND:CHEMBL112 rather than RXNORM:161 and RXNORM:161 is just one of its equivalent_curies. So the old DTD model doesn't contain this curie id and thus it can't calculate a probability. But this should be fixed after the new DTD model is re-built.

@amykglen, Please help me double check if they are really fixed in your side.

edeutsch commented 3 years ago

Note that RXNORM:161 was not intended as the leader of the Acetaminophen group. This was due to a bug, which I believe I have just fixed. A new NodeSynonymizer database is currently being rebuilt and should be done shortly. I will NOT destroy the previous one yet in databases/KG2.5.2.

A relevant question is what to do with it. Depending on where you all are in your process, you can choose to use the new one or not. There are known problems with it that will take a while to fix and so it is useful to consider how to apply those fixes.

One idea that occurred to me is: if the DTD database is sensitive to the leader of each group (CHEMBL.COMPOUND:CHEMBL112 vs RXNORM:161) perhaps there could be a process that loops through all entities in an existing DTD database, and updates the leader CURIE based on a revised NodeSynonymizer database?

amykglen commented 3 years ago

One idea that occurred to me is: if the DTD database is sensitive to the leader of each group (CHEMBL.COMPOUND:CHEMBL112 vs RXNORM:161) perhaps there could be a process that loops through all entities in an existing DTD database, and updates the leader CURIE based on a revised NodeSynonymizer database?

I think this is a great idea! another idea I had was that perhaps DTD could sort of have its own 'preferred' curies - e.g., maybe it chooses to always use the curie that comes first when you alphabetize the list of equivalent curies; then it wouldn't really matter to DTD what the synonymizer says the preferred curie is, it could just grab the equivalent curies for the input node, find the one that comes first alphabetically, and look that up in the database?

chunyuma commented 3 years ago

One idea that occurred to me is: if the DTD database is sensitive to the leader of each group (CHEMBL.COMPOUND:CHEMBL112 vs RXNORM:161) perhaps there could be a process that loops through all entities in an existing DTD database, and updates the leader CURIE based on a revised NodeSynonymizer database?

I think this is a good idea to make the v2.3.4 DTD database temporally compatible with kg2.5.2 before the new DTD database is re-built which needs around 2 weeks. Once the revised NodeSynonymizer database is built, I can do this.

amykglen commented 3 years ago

great! and you could also run the same script on the 2.5.2 DTD database after it finishes building, right, @chunyuma? (since this latest synonymizer will have some differences in preferred curies vs. the one DTD is currently being built off of)

chunyuma commented 3 years ago

But I think this might still have some cases that some preferred curies in kg2.5.2c are missing in v2.3.4 DTD database because currently the curie ids stored in v2.3.4 DTD database are preferred curies based on the v2.3.4 NodeSynonymizer. So if we use the idea proposed by @edeutsch, it might still miss some curies whose preferred curies based on v2.5.2 NodeSynonymizer are not the preferred curies based on v2.3.4 NodeSynonymizer.

edeutsch commented 3 years ago

I'm thinking that you loop through all entities in the DTD and ask the latest NodeSynonymizer "what is the preferred curie today for the CURIE in DTD?" and if that answer is different than in DTD, then update it. It shouldn't matter too much if it was preferred or not before?

The problem comes when there is splitting or coalescing of clusters. that will happen. rarely but will happen. I propose we just do the best we can.

edeutsch commented 3 years ago

So as of now, there is in /mnt/data/orangeboard/databases/KG2.5.2/

-rw-r--r-- 1 giovanmi bauerga 16602951680 Mar 13 05:07 node_synonymizer.sqlite
-rw-r--r-- 1 giovanmi bauerga 16607158272 Mar 11 18:58 node_synonymizer.sqlite-obsolete
-rw-r--r-- 1 rt       rt      16667919360 Mar 20 04:49 node_synonymizer_v1.1.sqlite

I propose we delete the "obsolete" file. I don't know what it is now any more.

The one I just created and propose we use going forward if possible is v1.1.

DTD building can continue with the previous version if you can write a "refreshing" system to refresh when you're done. By the time you're done, there may be even another update.

chunyuma commented 3 years ago

great! and you could also run the same script on the 2.5.2 DTD database after it finishes building, right, @chunyuma? (since this latest synonymizer will have some differences in preferred curies vs. the one DTD is currently being built off of)

But the problem I just mentioned might still exist. Since currently I'm building the DTD model based on the current version of synonymizer which only considered the preferred curies, if there is a curie which is not the preferred curie based on the current version of synonymizer become a preferred curie after @edeutsch revise the synonymizer. Even though I run the same script on the 2.5.2 DTD database, this curie will be still missing in DTD database.

chunyuma commented 3 years ago

I'm thinking that you loop through all entities in the DTD and ask the latest NodeSynonymizer "what is the preferred curie today for the CURIE in DTD?" and if that answer is different than in DTD, then update it. It shouldn't matter too much if it was preferred or not before?

The problem comes when there is splitting or coalescing of clusters. that will happen. rarely but will happen. I propose we just do the best we can.

Yes, @edeutsch I think your idea is pretty good idea and it is the best way that we can do. But what I just said is that if one curie is not a preferred curie based on v2.3.4 NodeSynonymizer but becomes a preferred curie in v2.5.2. This curie will still miss although I loop through all entities by using the latest version of NodeSynonymizer.

chunyuma commented 3 years ago

So as of now, there is in /mnt/data/orangeboard/databases/KG2.5.2/
-rw-r--r-- 1 giovanmi bauerga 16602951680 Mar 13 05:07 node_synonymizer.sqlite
-rw-r--r-- 1 giovanmi bauerga 16607158272 Mar 11 18:58 node_synonymizer.sqlite-obsolete
-rw-r--r-- 1 rt       rt      16667919360 Mar 20 04:49 node_synonymizer_v1.1.sqlite
I propose we delete the "obsolete" file. I don't know what it is now any more.

The one I just created and propose we use going forward if possible is v1.1.

DTD building can continue with the previous version if you can write a "refreshing" system to refresh when you're done. By the time you're done, there may be even another update.

No problem! Once the new DTD is done, I can use the latest version of nodesynonymizer to refresh it. If the difference of nodesynonymizer is minor, it should not affect too much.

edeutsch commented 3 years ago

Yes, @edeutsch I think your idea is pretty good idea and it is the best way that we can do. But what I just said is that if one curie is not a preferred curie based on v2.3.4 NodeSynonymizer but becomes a preferred curie in v2.5.2. This curie will still miss although I loop through all entities by using the latest version of NodeSynonymizer.

Hmm, I think I don't understand your concern. For each and every curie in any version of DTD, today's NodeSynonymizer will be able to tell you what today's preferred curie for that curie is. The only problem comes when there is merging or splitting or radical changes in category. which will happen, but hopefully rarely.

chunyuma commented 3 years ago

Hmm, I think I don't understand your concern. For each and every curie in any version of DTD, today's NodeSynonymizer will be able to tell you what today's preferred curie for that curie is. The only problem comes when there is merging or splitting or radical changes in category. which will happen, but hopefully rarely.

Right, I know that today's NodeSynonymizer will be able to tell you what today's preferred curie for that curie is. But when I build DTD model and DTD database, I only used the curies ids of kg2.5.2c (which are the preferred curies) but not kg2.5.2. This means that the DTD model and database only have the preferred curies based on today's NodeSynonymizer. Assume for the future version of NodeSynonymizer, it tells me that one curie in today's kg2.5.2 which is not considered as a preferred curie based on the today's NodeSynonymizer (this means that this curie is not in DTD model and database) but becomes a preferred curie. Does this make sense?

chunyuma commented 3 years ago

Here is an example:

'DOID:5849' is not a curie id of today's kg2.5.2c but is in the equivalent_curies of 'MONDO:000367' in kg2.5.2c. Assume in the revised NodeSynonymizer, it becomes a preferred curie. In this case, although I loop through all entities in the DTD by using the revised NodeSynonymizer, it will still not exist in DTD because currently this curie is not in DTD.

edeutsch commented 3 years ago

Ah, maybe the confusion is my assumption that NodeSynonymzer and KG2C are always in sync.

so this is a good example. Let's say that your already-built DTD has MONDO:000367 as a disease. Tomorrow I create a NodeSynonymizer where DOID:5849 is the preferred CURIE. And the KG2C is rebuilt with this new preferred curie. NodeSynonymizer and KG2C are in sync. but DTD is out of sync.

So the you run on DTD:

for current_curie in DTD.curies:
    potential_new_curie = synonymizer.get_canonical_curie(current_curie)
    if current_cure != potential_new_curie:
         DTD.execute(f"UPDATE curies SET curie = '{potential_new_curie}' WHERE curie = '{current_curie}'")

by doing this, your MONDO:000367 is updated to DOID:5849 and DTD is once again in sync with the latest NodeSynonymizer and KG2C.

chunyuma commented 3 years ago

Hi @edeutsch, I see your point. But the situation that I concerned is like this although I don't know if it will happen:

Let's say we have three curie ids: MONODO:1234, DTD:1234, DTD:5678.

Currently, MONODO:1234 is the synonym of DTD:1234 and MONODO:1234 is a preferred curie but DTD:1234 is not. My already-built DTD has MONODO:1234 as a disease. DTD:5678 is also a preferred curie but currently is not considered as a synonym of MONODO:1234 or DTD:1234. Tomorrow you create a NodeSynonymizer where MONODO:1234 becomes a synonym of DTD:5678 while DTD:5678 is still a preferred curie but MONODO:1234 is not anymore. But DTD:1234 becomes a preferred curie.

So although I update the curies by using your method, it has only DTD:5678 in the updated DTD and the updated DTD might be be in sync with the latest NodeSynonymizer and KG2C.

Will this be possible? I guess it might be a rare case. If we don't need to worry about this, then I think the method you proposed should be prefect, even for updating v2.3.4 DTD for kg2.5.2.

edeutsch commented 3 years ago

yes, crazy situations like this will happen but should be quite rare, and we will just have to live with them. Overlay will just need to accept and handle the case where DTD cannot provide an answer for a drug-disease pair because it wasn't present in that form when the DTD was built.

amykglen commented 3 years ago

alright, since it sounds like the 'refresh' approach will work for DTD, I've started a rebuild of KG2c.5.2 using node_synonymizer_v1.1.sqlite.

question about naming of the node synonymizer file for @edeutsch: it seems that the synonymizer code expects the sqlite file to be called node_synonymizer.sqlite. is that correct? and are you planning to keep it that way? in which case we would need to rename node_synonymizer_v1.1.sqlite --> node_synonymizer.sqlite? (in /translator/data/orangeboard/databases/KG2.5.2)

chunyuma commented 3 years ago

OK, since we decide to use node_synonymizer_v1.1.sqlite. I will use it to rebuild COHD database although I have already rebuilt it based on node_synonymizer.sqlite. Rebuilding COHD database is fast so it is not a big deal.

question about naming of the node synonymizer file for @edeutsch: it seems that the synonymizer code expects the sqlite file to be called node_synonymizer.sqlite. is that correct? and are you planning to keep it that way? in which case we would need to rename node_synonymizer_v1.1.sqlite --> node_synonymizer.sqlite?

I would suggest to name the current node_synonymizer.sqlite as node_synonymizer_bk.sqlite (backup) while name node_synonymizer_v1.1.sqlite as node_synonymizer.sqlite.

amykglen commented 3 years ago

alright, revised KG2c.5.2 made with the node_synonymizer_v1.1.sqlite is up at http://kg2c-5-2.rtx.ai:7474/browser/

confirmed acetaminophen now has a preferred curie of CHEMBL.COMPOUND:CHEMBL112!

edeutsch commented 3 years ago

question about naming of the node synonymizer file for @edeutsch: it seems that the synonymizer code expects the sqlite file to be called node_synonymizer.sqlite. is that correct? and are you planning to keep it that way? in which case we would need to rename node_synonymizer_v1.1.sqlite --> node_synonymizer.sqlite? (in /translator/data/orangeboard/databases/KG2.5.2)

@finnagin made some changes to NodeSynonmizer that I don't quite understand. There is a parameter on line 30 live="Production". I don't quite understand what a "live" is.

And yes, I just noticed that I commented out line 51 and put in a bypass of line 50, hard-coding it to node_synonymizer.sqlite. I did that mostly in haste trying to get it working in my dev environment, because as it was it was not working for me and I did not understand how to fix it. I did not intend to check it in.

I am currently running in my kg2.5integration area with the line 50 bypass and a sym link: lrwxrwxrwx 1 rt rt 68 Mar 20 04:58 node_synonymizer.sqlite -> /mnt/data/orangeboard/databases/KG2.5.2/node_synonymizer_v1.1.sqlite

So, I'm thinking that as part of the branch kg2.5integration merge and rollout, I need to understand how to use the "live" thing and get this tidied up.

It does seems like there is supposed to be some vX.X_KG2.Y.Y naming convention in databases/KG2.5.2/ but I guess I need help understanding it. It seems like we are only partially using and I don't understand it.

At one point I was under the impression that while master was using rtxconfigv2, the lg2.5integration branch was not yet using rtxconfigv2 yet or something.

Feel free to rename/tidy things up and explain. I assumed this would happen during kg2,5integration -> master.

amykglen commented 3 years ago

my understanding of the idea is that the synonymizer code should look at configv2.json to figure out the name of the synonymizer file it should be using. so, for example, if we:

change the 'node_synonymizer' slots in our copy of configv2.json (in the KG2.5.2 directory) to point to /translator/data/orangeboard/databases/KG2.5.2/node_synonymizer_v1.1.sqlite
make the synonymizer code look at configv2.json to determine that it should be using a file named node_synonymizer_v1.1.sqlite (I don't 100% understand @finnagin's method for achieving this, so perhaps he can help here)

I think that would get us into a working state.

or alternatively we could just leave the code as you have it for now (with the bypass) and just rename node_synonymizer_v1.1.sqlite --> node_synonymizer.sqlite. think that would also result in a working system.

(and to clarify - the kg2.5integration branch does use configv2.json vs. config.json now.)

in other news: the pytest suite is passing for me in kg2.5integration now! (note that to see this you have to locally force mode="RTXKG2" in Expand, since otherwise the KG2 API will be used, which doesn't yet have the code changes in kg2.5integration.)

so I believe all is done from my end... will we wait for the COHD database/a 'refreshed' DTD? or roll out without those?

finnagin commented 3 years ago

@amykglen @edeutsch I can explain the live thing. When Deqing made the RTXConfiguration class he made it so you can change the context by setting a string property called live. i.e.

RTXConfig = RTXConfiguration()
RTXConfig.live = "KG2"

will set all of the values like neo4j_bolt, etc... to match that which is listed under the KG2 section of the configv2.json file.

In this case I've set it to Production so it will take all the values from that section of configv2.json. The reason for this is that if you look at lines 59-60 you'll see that I've commented out the hard-coded database name and replaced it with the filename stored in configv2.json. That way whenever we change database versions in production all we need to do is update the configv2.json and it will automatically know where to look for the new file rather than having to update the hard-coded name every time.

Does that make sense?

amykglen commented 3 years ago

that makes sense to me. thanks! so when we update the path for a new database (like the latest curie_to_pmids_v1.1_KG2.5.2.sqlite), we need to do so in the KG2, KG2C, and Production sections? and maybe local as well? (I'm not sure when we would want to specify different database paths for these different sections?)

chunyuma commented 3 years ago

Hi @finnagin, could you please help me remove rel_max and map_txt from the automatic download list because they are not required for predict_drug_treats_disease to successfully run? Since I don't quite understand how the automatic download works, I'm afraid that it will affect something if I directly remove the corresponding lines in configv2.json file.

But notice that please don't delete map.txt and rel_max.emb.gz stored in arax.ncats.io because they are two required files for generating the slim version of DTD database. So just don't let them automatically be downloaded. Thank you!

chunyuma commented 3 years ago

OK, the v2.5.2 COHD database and the refreshed DTD model and database are ready and uploaded to arax.ncats.io:

The configv2.json file is also updated to point to these databases.

finnagin commented 3 years ago

@amykglen Yeah that would be good. At least in KG2, KG2C, and Production. I'm not actually sure where local gets used.

@chunyuma Yeah I'll maker an issue and remove those from the download manager and rtx configuration in master. I thought those were needed for something in ARAX to work.

amykglen commented 3 years ago

awesome! so I guess aside from the issue with the synonymizer's name, we're ready to roll out whenever? I think it should just be a matter of merging kg2.5integration into master, rolling master out to production (ARAX and KG2 API), and putting the new configv2.json into place. then the rest (database downloads) should take care of itself, I think.

finnagin commented 3 years ago

Yeah, I think so. The only thing I'd add is that it's probably a good idea to double check that all the slow tests pass if you haven't already.

amykglen commented 3 years ago

yeah, probably a good idea. the only weird thing about that is that I believe some of the tests can't pass until kg2.5integration is rolled out to the KG2 API. (since ARAX uses the KG2 API.) I had been locally forcing it to directly query neo4j instead of the KG2 API, but that isn't perfect due to intricacies of some of the slow tests...

RTXteam / RTX

Move ARAX to KG2.5.2 #1233