Closed amykglen closed 3 years ago
ok, @chunyuma - a new KG2c built from KG2.5.1 is up at: http://kg2canonicalized2.rtx.ai:7474/browser/
note that these node properties have been renamed like so:
preferred_type
-> category
types
-> all_categories
Thanks @amykglen!
Tagging @kvarforl to also read this issue, as a part of cross-training on the ARAX system and its operations/deployment. In particular, it would be good for @kvarforl to get up-to-speed on how to build KG2c (and verify that the build was successful) and the NodeSynonymizer database.
Generated the csvs for KG2C looks like it has ~33,000 rows in the allowed predicate triples.
rebuild of kg2c.5.2 with the tweaked NodeSynonymizer is done, and things seem to look good! looks like the category_manager changes @edeutsch made for conflation of ancestors and such were successful, maybe with one very minor exception - just a handful of nodes remain that are "ChemicalSubstance"s but are missing "Drug" in their expanded_categories:
match (n:`biolink:ChemicalSubstance`) where not "biolink:Drug" in n.expanded_categories return n.id, n.name, n.expanded_categories
n.id | n.name | n.expanded_categories |
---|---|---|
"ttd.target:CAR-T Cells targeting Mesothelin" | "CAR-T Cells targeting Mesothelin" | ["biolink:ChemicalSubstance"] |
"ttd.target:CAR-T cells targeting Mesothelin" | "CAR-T cells targeting Mesothelin" | ["biolink:ChemicalSubstance"] |
"ttd.target:CD19-targeting CAR T cells" | "CD19-targeting CAR T cells" | ["biolink:ChemicalSubstance"] |
"ttd.target:CD19 CAR-T cells" | "CD19 CAR-T cells" | ["biolink:ChemicalSubstance"] |
"ttd.target:CD19 CAR T cells" | "CD19 CAR T cells" | ["biolink:ChemicalSubstance"] |
"ttd.target:CAR-T Cells targeting EpCAM" | "CAR-T Cells targeting EpCAM" | ["biolink:ChemicalSubstance"] |
"ttd.target:CAR-T cells targeting CEA" | "CAR-T cells targeting CEA" | ["biolink:ChemicalSubstance"] |
"ttd.target:CAR-T Cells targeting CD19" | "CAR-T Cells targeting CD19" | ["biolink:ChemicalSubstance"] |
"ttd.target:CAR-T Cells targeting PSCA" | "CAR-T Cells targeting PSCA" | ["biolink:ChemicalSubstance"] |
"ttd.target:CAR-T cells targeting Muc1" | "CAR-T cells targeting Muc1" | ["biolink:ChemicalSubstance"] |
"ttd.target:CAR-GPC3 T Cells" | "CAR-GPC3 T Cells" | ["biolink:ChemicalSubstance"] |
"ttd.target:CAR-T cells targeting HER2" | "CAR-T cells targeting HER2" | ["biolink:ChemicalSubstance"] |
"ttd.target:CAR-T cells targeting MucI" | "CAR-T cells targeting MucI" | ["biolink:ChemicalSubstance"] |
I think this definitely doesn't need to stop us from continuing with this roll out (such a small number of affected nodes), but reporting these here for @edeutsch.
I'm getting close to having the pytest suite passing again in kg2.5integration
- will likely reach out to a couple people shortly about some lingering issues.
awesome news, @amykglen ! Looking into this a little, it looks to me that NodeSynonymizer get_equivalent_nodes(curies=curies,return_all_categories=True) appears to be doing the right thing:
{
"ttd.target:CAR-T Cells targeting Mesothelin": {
"all_categories": {
"biolink:ChemicalSubstance": 1
},
"expanded_categories": {
"biolink:BiologicalEntity": true,
"biolink:ChemicalSubstance": true,
"biolink:Drug": true,
"biolink:MolecularEntity": true,
"biolink:NamedThing": true
},
"ttd.target:CAR-T Cells targeting Mesothelin": "biolink:ChemicalSubstance"
}
}
So I'm puzzled why KG2C isn't showing that..
But one slightly alarming thing I notice is that all these CURIEs all have spaces in them! Is that allowed? I vaguely recall writing code in the past assuming that CURIEs don't have space in them.
I wonder if this space is somehow related?
huh, very weird! I'll have to look more into why those aren't making it into KG2c... not seeing an obvious reason.
hmm, ok, think I've narrowed it down here. so when I send only the curie 'ttd.target:CAR-T Cells targeting Mesothelin' to get_canonical_curies() like you did, I get the same response as you. but when I send that curie PLUS one of its equivalent curies (the same but with lowercase 'cells'), it then claims it doesn't recognize 'ttd.target:CAR-T Cells targeting Mesothelin'. i.e., doing:
curies = ['ttd.target:CAR-T Cells targeting Mesothelin', 'ttd.target:CAR-T cells targeting mesothelin']
print(synonymizer.get_canonical_curies(curies, return_all_categories=True))
prints out:
{
"ttd.target:CAR-T Cells targeting Mesothelin":"None",
"ttd.target:CAR-T cells targeting mesothelin":{
"preferred_curie":"ttd.target:CAR-T Cells targeting Mesothelin",
"preferred_name":"CAR-T Cells targeting Mesothelin",
"preferred_category":"biolink:ChemicalSubstance",
"expanded_categories":{
"biolink:ChemicalSubstance":true,
"biolink:Drug":true,
"biolink:MolecularEntity":true,
"biolink:BiologicalEntity":true,
"biolink:NamedThing":true
},
"all_categories":{
"biolink:ChemicalSubstance":1
}
}
}
so because 'canonical info' isn't returned for 'ttd.target:CAR-T Cells targeting Mesothelin', KG2c defaults to saying that all_categories and expanded_categories must just contain the node's category from KG2. (the reasoning behind having this logic in place was mainly because some nodes in KG2 don't have names, and thus wouldn't be recognized by the synonymizer, but nonetheless should appear in KG2c).
perhaps related, these curies also seem to be affected by an issue I think I've reported before (#1074):
when I do
curies = ['ttd.target:CAR-T Cells targeting Mesothelin', 'ttd.target:CAR-T cells targeting mesothelin']
print(synonymizer.get_equivalent_nodes(curies))
this is what's printed
{
"ttd.target:CAR-T Cells targeting Mesothelin":"None",
"ttd.target:CAR-T cells targeting mesothelin":{
"ttd.target:CAR-T Cells targeting Mesothelin":"biolink:ChemicalSubstance"
}
}
so you can see the curie with the uppercase 'Cells' is listed as an equivalent curie for the curie with the lowercase 'cells', but not vice versa.
ok, code changes are largely finished in kg2.5integration
, but a few more tests are still failing (think these belong to @chunyuma and @finnagin ?):
FAILED test_ARAX_expand.py::test_COHD_expand_chi_square - AssertionError: assert ('ERROR' == 'OK'
FAILED test_ARAX_expand.py::test_DTD_expand_1 - AssertionError: assert ('ERROR' == 'OK'
FAILED test_ARAX_filter_kg.py::test_error - AssertionError: assert 'OK' == 'ERROR'
FAILED test_ARAX_overlay.py::test_FET_ex1 - AssertionError: assert 'ERROR' == 'OK'
FAILED test_ARAX_workflows.py::test_example_3 - AssertionError: assert 'ERROR' == 'OK'
FAILED test_ARAX_workflows.py::test_FET_ranking - AssertionError: assert 'ERROR' == 'OK'
maybe some of these should be failing, since we decided to move forward without waiting for DTD? but would be great if @chunyuma and @finnagin could take a look at any tests of theirs in this list and push fixes as appropriate to kg2.5integration
.
(note that in order to reproduce these, you will likely need to be using the configv2.json
and node_synonymizer.sqlite
that are in /translator/data/orangeboard/databases/KG2.5.2
- that's what I'm using)
Looks like test_ARAX_filter_kg.py::test_error
is failing because the edge predicate that is being returned in message.knowledge_graph
is contraindicated_for
rather than biolink:contraindicated_for
so the filter does not remove the edge it is supposed to. Is this something that is expected? I thought we were supposed to be only working with biolink:contraindicated_for
now that we are switched to KG2.5.2.
For reference the query is:
add_qnode(name=DOID:1227, key=n00)
add_qnode(category=biolink:ChemicalSubstance, key=n01)
add_qedge(subject=n00, object=n01, key=e00)
expand(edge_key=e00, kp=ARAX/KG1)
resultify(ignore_edge_direction=true)
return(message=true, store=false)
It looks like it is correct in kg2canonicalized2
not sure why this is returning predicated without the biolink: prefix. Maybe something wrong with my setup? I'm using the configv2.json in /translator/data/orangeboard/databases/KG2.5.2. Could you try running that query and see if you also get contraindicated_for
edges without the biolink prefix back?
Lol nevermind just realized it's expanding on KG1 which is why this is returning the non biolink prefix edges. Though this brings up an interesting question: Do we want to have a patch that converts the results of KG1 expand calls or are we in the process of phasing out KG1 anyways so it's not worth it?
I think we just discontinue use of KG1 now. So I suggest changing all tests away from KG1 and we turn it off eventually.
Yeah, that makes sense. From a quick grep it looks like ARAX/KG1 is use serveral times throughout test_ARAX_expand, test_ARAX_resultify, test_ARAX_workplows, test_ARAX_filter_kg, test_ARAX_filter_results, and test_ARAX_overlay. This may take a bit of refactoring. Do we want to make a separate issue for converting all of these?
@finnagin I could go either way with making a new ticket for the KG1 in the tests: it blocks this issue, but I imagine a simple sed -i
or manually replace of KG1->KG2 should work (and a slight adjustment of the assert
s. Do you think more refactoring than that will be required?
I would vote for creating a separate issue for that; doesn't need to hold up this issue. I think it would likely break a lot of tests (since it changes their output), and it's also going to slow the test suite waayy down. may be smart to wait until we have a faster system for KG2c in place.
@dkoslicki I do think more refactoring will likely be required. As @amykglen said we will likely need to adjust a few tests to speed them up. For example I was just changing test_warning in test_ARAX_filter_kg.py and it is now incredibly slow as it uses overlay(action=add_node_pmids)
which apparently hits NCBIeutils and thus takes a long time with the number of results returned.
@amykglen I was able to fix test_ARAX_workflows.py::test_example_3
by changing it to use KG2 instead of KG1 however now it takes ~218s to complete so I have added a pytest slow marker to it. Does this seem reasonable for now while we wait for a faster KG2c system?
Regarding this from @amykglen
this is what's printed
{ "ttd.target:CAR-T Cells targeting Mesothelin":"None", "ttd.target:CAR-T cells targeting mesothelin":{ "ttd.target:CAR-T Cells targeting Mesothelin":"biolink:ChemicalSubstance" } }
so you can see the curie with the uppercase 'Cells' is listed as an equivalent curie for the curie with the lowercase 'cells', but not vice versa.
I understand why this is happening. There is an underlying assumption in NodeSynonymizer that CURIEs can be treated as case insensitive. Here we have two CURIEs that differ only in case. @saramsey already flagged this as a potential bug? I'm thinking that NodeSynonymizer code doesn't need to change, and this will no longer be a problem if we can resolve these weird CURIEs.
thoughts?
@finnagin - that makes good sense to me!
ok, code changes are largely finished in
kg2.5integration
, but a few more tests are still failing (think these belong to @chunyuma and @finnagin ?):FAILED test_ARAX_expand.py::test_COHD_expand_chi_square - AssertionError: assert ('ERROR' == 'OK' FAILED test_ARAX_expand.py::test_DTD_expand_1 - AssertionError: assert ('ERROR' == 'OK' FAILED test_ARAX_filter_kg.py::test_error - AssertionError: assert 'OK' == 'ERROR' FAILED test_ARAX_overlay.py::test_FET_ex1 - AssertionError: assert 'ERROR' == 'OK' FAILED test_ARAX_workflows.py::test_example_3 - AssertionError: assert 'ERROR' == 'OK' FAILED test_ARAX_workflows.py::test_FET_ranking - AssertionError: assert 'ERROR' == 'OK'
maybe some of these should be failing, since we decided to move forward without waiting for DTD? but would be great if @chunyuma and @finnagin could take a look at any tests of theirs in this list and push fixes as appropriate to
kg2.5integration
.(note that in order to reproduce these, you will likely need to be using the
configv2.json
andnode_synonymizer.sqlite
that are in/translator/data/orangeboard/databases/KG2.5.2
- that's what I'm using)
Hi @amykglen, I fixed most of the failing tests you posted. Only one can't be fixed right now until the new DTD model and database are re-built is test_ARAX_expand.py::test_DTD_expand_1
.
"add_qnode(name=acetaminophen, key=n0)",
"add_qnode(name=Sotos syndrome, key=n1)",
"add_qedge(subject=n0, object=n1, key=e0)",
"expand(edge_key=e0, kp=DTD, DTD_threshold=0, DTD_slow_mode=True)",
"return(message=true, store=false)"
In this case, it expands a virtual edge from acetaminophen
to Sotos syndrome
based on DTD probability > 0. However, the preferred curie of Acetaminophen
is RXNORM:161
based on Nodesynonymizer v2.5.2. However, in v2.3.4, the preferred curie of ACETAMINOPHEN
is CHEMBL.COMPOUND:CHEMBL112
rather than RXNORM:161
and RXNORM:161
is just one of its equivalent_curies. So the old DTD model doesn't contain this curie id and thus it can't calculate a probability. But this should be fixed after the new DTD model is re-built.
@amykglen, Please help me double check if they are really fixed in your side.
Note that RXNORM:161 was not intended as the leader of the Acetaminophen group. This was due to a bug, which I believe I have just fixed. A new NodeSynonymizer database is currently being rebuilt and should be done shortly. I will NOT destroy the previous one yet in databases/KG2.5.2.
A relevant question is what to do with it. Depending on where you all are in your process, you can choose to use the new one or not. There are known problems with it that will take a while to fix and so it is useful to consider how to apply those fixes.
One idea that occurred to me is: if the DTD database is sensitive to the leader of each group (CHEMBL.COMPOUND:CHEMBL112 vs RXNORM:161) perhaps there could be a process that loops through all entities in an existing DTD database, and updates the leader CURIE based on a revised NodeSynonymizer database?
One idea that occurred to me is: if the DTD database is sensitive to the leader of each group (CHEMBL.COMPOUND:CHEMBL112 vs RXNORM:161) perhaps there could be a process that loops through all entities in an existing DTD database, and updates the leader CURIE based on a revised NodeSynonymizer database?
I think this is a great idea! another idea I had was that perhaps DTD could sort of have its own 'preferred' curies - e.g., maybe it chooses to always use the curie that comes first when you alphabetize the list of equivalent curies; then it wouldn't really matter to DTD what the synonymizer says the preferred curie is, it could just grab the equivalent curies for the input node, find the one that comes first alphabetically, and look that up in the database?
One idea that occurred to me is: if the DTD database is sensitive to the leader of each group (CHEMBL.COMPOUND:CHEMBL112 vs RXNORM:161) perhaps there could be a process that loops through all entities in an existing DTD database, and updates the leader CURIE based on a revised NodeSynonymizer database?
I think this is a good idea to make the v2.3.4 DTD database temporally compatible with kg2.5.2 before the new DTD database is re-built which needs around 2 weeks. Once the revised NodeSynonymizer database is built, I can do this.
great! and you could also run the same script on the 2.5.2 DTD database after it finishes building, right, @chunyuma? (since this latest synonymizer will have some differences in preferred curies vs. the one DTD is currently being built off of)
But I think this might still have some cases that some preferred curies in kg2.5.2c are missing in v2.3.4 DTD database because currently the curie ids stored in v2.3.4 DTD database are preferred curies based on the v2.3.4 NodeSynonymizer. So if we use the idea proposed by @edeutsch, it might still miss some curies whose preferred curies based on v2.5.2 NodeSynonymizer are not the preferred curies based on v2.3.4 NodeSynonymizer.
I'm thinking that you loop through all entities in the DTD and ask the latest NodeSynonymizer "what is the preferred curie today for the CURIE in DTD?" and if that answer is different than in DTD, then update it. It shouldn't matter too much if it was preferred or not before?
The problem comes when there is splitting or coalescing of clusters. that will happen. rarely but will happen. I propose we just do the best we can.
So as of now, there is in /mnt/data/orangeboard/databases/KG2.5.2/
-rw-r--r-- 1 giovanmi bauerga 16602951680 Mar 13 05:07 node_synonymizer.sqlite
-rw-r--r-- 1 giovanmi bauerga 16607158272 Mar 11 18:58 node_synonymizer.sqlite-obsolete
-rw-r--r-- 1 rt rt 16667919360 Mar 20 04:49 node_synonymizer_v1.1.sqlite
I propose we delete the "obsolete" file. I don't know what it is now any more.
The one I just created and propose we use going forward if possible is v1.1.
DTD building can continue with the previous version if you can write a "refreshing" system to refresh when you're done. By the time you're done, there may be even another update.
great! and you could also run the same script on the 2.5.2 DTD database after it finishes building, right, @chunyuma? (since this latest synonymizer will have some differences in preferred curies vs. the one DTD is currently being built off of)
But the problem I just mentioned might still exist. Since currently I'm building the DTD model based on the current version of synonymizer which only considered the preferred curies, if there is a curie which is not the preferred curie based on the current version of synonymizer become a preferred curie after @edeutsch revise the synonymizer. Even though I run the same script on the 2.5.2 DTD database, this curie will be still missing in DTD database.
I'm thinking that you loop through all entities in the DTD and ask the latest NodeSynonymizer "what is the preferred curie today for the CURIE in DTD?" and if that answer is different than in DTD, then update it. It shouldn't matter too much if it was preferred or not before?
The problem comes when there is splitting or coalescing of clusters. that will happen. rarely but will happen. I propose we just do the best we can.
Yes, @edeutsch I think your idea is pretty good idea and it is the best way that we can do. But what I just said is that if one curie is not a preferred curie based on v2.3.4 NodeSynonymizer but becomes a preferred curie in v2.5.2. This curie will still miss although I loop through all entities by using the latest version of NodeSynonymizer.
So as of now, there is in /mnt/data/orangeboard/databases/KG2.5.2/
-rw-r--r-- 1 giovanmi bauerga 16602951680 Mar 13 05:07 node_synonymizer.sqlite -rw-r--r-- 1 giovanmi bauerga 16607158272 Mar 11 18:58 node_synonymizer.sqlite-obsolete -rw-r--r-- 1 rt rt 16667919360 Mar 20 04:49 node_synonymizer_v1.1.sqlite
I propose we delete the "obsolete" file. I don't know what it is now any more.
The one I just created and propose we use going forward if possible is v1.1.
DTD building can continue with the previous version if you can write a "refreshing" system to refresh when you're done. By the time you're done, there may be even another update.
No problem! Once the new DTD is done, I can use the latest version of nodesynonymizer to refresh it. If the difference of nodesynonymizer is minor, it should not affect too much.
Yes, @edeutsch I think your idea is pretty good idea and it is the best way that we can do. But what I just said is that if one curie is not a preferred curie based on v2.3.4 NodeSynonymizer but becomes a preferred curie in v2.5.2. This curie will still miss although I loop through all entities by using the latest version of NodeSynonymizer.
Hmm, I think I don't understand your concern. For each and every curie in any version of DTD, today's NodeSynonymizer will be able to tell you what today's preferred curie for that curie is. The only problem comes when there is merging or splitting or radical changes in category. which will happen, but hopefully rarely.
Hmm, I think I don't understand your concern. For each and every curie in any version of DTD, today's NodeSynonymizer will be able to tell you what today's preferred curie for that curie is. The only problem comes when there is merging or splitting or radical changes in category. which will happen, but hopefully rarely.
Right, I know that today's NodeSynonymizer will be able to tell you what today's preferred curie for that curie is. But when I build DTD model and DTD database, I only used the curies ids of kg2.5.2c (which are the preferred curies) but not kg2.5.2. This means that the DTD model and database only have the preferred curies based on today's NodeSynonymizer. Assume for the future version of NodeSynonymizer, it tells me that one curie in today's kg2.5.2 which is not considered as a preferred curie based on the today's NodeSynonymizer (this means that this curie is not in DTD model and database) but becomes a preferred curie. Does this make sense?
Here is an example:
'DOID:5849' is not a curie id of today's kg2.5.2c but is in the equivalent_curies of 'MONDO:000367' in kg2.5.2c. Assume in the revised NodeSynonymizer, it becomes a preferred curie. In this case, although I loop through all entities in the DTD by using the revised NodeSynonymizer, it will still not exist in DTD because currently this curie is not in DTD.
Ah, maybe the confusion is my assumption that NodeSynonymzer and KG2C are always in sync.
so this is a good example. Let's say that your already-built DTD has MONDO:000367 as a disease. Tomorrow I create a NodeSynonymizer where DOID:5849 is the preferred CURIE. And the KG2C is rebuilt with this new preferred curie. NodeSynonymizer and KG2C are in sync. but DTD is out of sync.
So the you run on DTD:
for current_curie in DTD.curies:
potential_new_curie = synonymizer.get_canonical_curie(current_curie)
if current_cure != potential_new_curie:
DTD.execute(f"UPDATE curies SET curie = '{potential_new_curie}' WHERE curie = '{current_curie}'")
by doing this, your MONDO:000367 is updated to DOID:5849 and DTD is once again in sync with the latest NodeSynonymizer and KG2C.
Hi @edeutsch, I see your point. But the situation that I concerned is like this although I don't know if it will happen:
Let's say we have three curie ids: MONODO:1234, DTD:1234, DTD:5678.
Currently, MONODO:1234 is the synonym of DTD:1234 and MONODO:1234 is a preferred curie but DTD:1234 is not. My already-built DTD has MONODO:1234 as a disease. DTD:5678 is also a preferred curie but currently is not considered as a synonym of MONODO:1234 or DTD:1234. Tomorrow you create a NodeSynonymizer where MONODO:1234 becomes a synonym of DTD:5678 while DTD:5678 is still a preferred curie but MONODO:1234 is not anymore. But DTD:1234 becomes a preferred curie.
So although I update the curies by using your method, it has only DTD:5678 in the updated DTD and the updated DTD might be be in sync with the latest NodeSynonymizer and KG2C.
Will this be possible? I guess it might be a rare case. If we don't need to worry about this, then I think the method you proposed should be prefect, even for updating v2.3.4 DTD for kg2.5.2.
yes, crazy situations like this will happen but should be quite rare, and we will just have to live with them. Overlay will just need to accept and handle the case where DTD cannot provide an answer for a drug-disease pair because it wasn't present in that form when the DTD was built.
alright, since it sounds like the 'refresh' approach will work for DTD, I've started a rebuild of KG2c.5.2 using node_synonymizer_v1.1.sqlite
.
question about naming of the node synonymizer file for @edeutsch: it seems that the synonymizer code expects the sqlite file to be called node_synonymizer.sqlite
. is that correct? and are you planning to keep it that way? in which case we would need to rename node_synonymizer_v1.1.sqlite
--> node_synonymizer.sqlite
? (in /translator/data/orangeboard/databases/KG2.5.2
)
OK, since we decide to use node_synonymizer_v1.1.sqlite
. I will use it to rebuild COHD database although I have already rebuilt it based on node_synonymizer.sqlite
. Rebuilding COHD database is fast so it is not a big deal.
question about naming of the node synonymizer file for @edeutsch: it seems that the synonymizer code expects the sqlite file to be called node_synonymizer.sqlite. is that correct? and are you planning to keep it that way? in which case we would need to rename node_synonymizer_v1.1.sqlite --> node_synonymizer.sqlite?
I would suggest to name the current node_synonymizer.sqlite
as node_synonymizer_bk.sqlite
(backup) while name node_synonymizer_v1.1.sqlite
as node_synonymizer.sqlite
.
alright, revised KG2c.5.2 made with the node_synonymizer_v1.1.sqlite
is up at http://kg2c-5-2.rtx.ai:7474/browser/
confirmed acetaminophen now has a preferred curie of CHEMBL.COMPOUND:CHEMBL112!
question about naming of the node synonymizer file for @edeutsch: it seems that the synonymizer code expects the sqlite file to be called node_synonymizer.sqlite. is that correct? and are you planning to keep it that way? in which case we would need to rename node_synonymizer_v1.1.sqlite --> node_synonymizer.sqlite? (in /translator/data/orangeboard/databases/KG2.5.2)
@finnagin made some changes to NodeSynonmizer that I don't quite understand. There is a parameter on line 30 live="Production". I don't quite understand what a "live" is.
And yes, I just noticed that I commented out line 51 and put in a bypass of line 50, hard-coding it to node_synonymizer.sqlite. I did that mostly in haste trying to get it working in my dev environment, because as it was it was not working for me and I did not understand how to fix it. I did not intend to check it in.
I am currently running in my kg2.5integration area with the line 50 bypass and a sym link: lrwxrwxrwx 1 rt rt 68 Mar 20 04:58 node_synonymizer.sqlite -> /mnt/data/orangeboard/databases/KG2.5.2/node_synonymizer_v1.1.sqlite
So, I'm thinking that as part of the branch kg2.5integration merge and rollout, I need to understand how to use the "live" thing and get this tidied up.
It does seems like there is supposed to be some vX.X_KG2.Y.Y naming convention in databases/KG2.5.2/ but I guess I need help understanding it. It seems like we are only partially using and I don't understand it.
At one point I was under the impression that while master was using rtxconfigv2, the lg2.5integration branch was not yet using rtxconfigv2 yet or something.
Feel free to rename/tidy things up and explain. I assumed this would happen during kg2,5integration -> master.
my understanding of the idea is that the synonymizer code should look at configv2.json
to figure out the name of the synonymizer file it should be using. so, for example, if we:
configv2.json
(in the KG2.5.2 directory) to point to /translator/data/orangeboard/databases/KG2.5.2/node_synonymizer_v1.1.sqlite
configv2.json
to determine that it should be using a file named node_synonymizer_v1.1.sqlite (I don't 100% understand @finnagin's method for achieving this, so perhaps he can help here)I think that would get us into a working state.
or alternatively we could just leave the code as you have it for now (with the bypass) and just rename node_synonymizer_v1.1.sqlite
--> node_synonymizer.sqlite
. think that would also result in a working system.
(and to clarify - the kg2.5integration
branch does use configv2.json
vs. config.json
now.)
in other news: the pytest suite is passing for me in kg2.5integration
now! (note that to see this you have to locally force mode="RTXKG2"
in Expand, since otherwise the KG2 API will be used, which doesn't yet have the code changes in kg2.5integration
.)
so I believe all is done from my end... will we wait for the COHD database/a 'refreshed' DTD? or roll out without those?
@amykglen @edeutsch I can explain the live thing. When Deqing made the RTXConfiguration class he made it so you can change the context by setting a string property called live. i.e.
RTXConfig = RTXConfiguration()
RTXConfig.live = "KG2"
will set all of the values like neo4j_bolt, etc... to match that which is listed under the KG2 section of the configv2.json file.
In this case I've set it to Production so it will take all the values from that section of configv2.json. The reason for this is that if you look at lines 59-60 you'll see that I've commented out the hard-coded database name and replaced it with the filename stored in configv2.json. That way whenever we change database versions in production all we need to do is update the configv2.json and it will automatically know where to look for the new file rather than having to update the hard-coded name every time.
Does that make sense?
that makes sense to me. thanks! so when we update the path for a new database (like the latest curie_to_pmids_v1.1_KG2.5.2.sqlite
), we need to do so in the KG2
, KG2C
, and Production
sections? and maybe local
as well? (I'm not sure when we would want to specify different database paths for these different sections?)
Hi @finnagin, could you please help me remove rel_max
and map_txt
from the automatic download list because they are not required for predict_drug_treats_disease
to successfully run? Since I don't quite understand how the automatic download works, I'm afraid that it will affect something if I directly remove the corresponding lines in configv2.json
file.
But notice that please don't delete map.txt
and rel_max.emb.gz
stored in arax.ncats.io
because they are two required files for generating the slim version of DTD database. So just don't let them automatically be downloaded. Thank you!
OK, the v2.5.2 COHD database and the refreshed DTD model and database are ready and uploaded to arax.ncats.io
:
The configv2.json
file is also updated to point to these databases.
@amykglen Yeah that would be good. At least in KG2
, KG2C
, and Production
. I'm not actually sure where local gets used.
@chunyuma Yeah I'll maker an issue and remove those from the download manager and rtx configuration in master. I thought those were needed for something in ARAX to work.
awesome! so I guess aside from the issue with the synonymizer's name, we're ready to roll out whenever? I think it should just be a matter of merging kg2.5integration
into master
, rolling master
out to production (ARAX and KG2 API), and putting the new configv2.json
into place. then the rest (database downloads) should take care of itself, I think.
Yeah, I think so. The only thing I'd add is that it's probably a good idea to double check that all the slow tests pass if you haven't already.
yeah, probably a good idea. the only weird thing about that is that I believe some of the tests can't pass until kg2.5integration
is rolled out to the KG2 API. (since ARAX uses the KG2 API.) I had been locally forcing it to directly query neo4j instead of the KG2 API, but that isn't perfect due to intricacies of some of the slow tests...
documentation about what things will need updating is at: https://github.com/RTXteam/RTX/wiki/Deployment-info#things-that-need-updating-when-rolling-out-a-new-kg2-version
rebuild/edit these files and put them in
/translator/data/orangeboard/databases/KG2.5.2
on arax.ncats.io:configv2.json
node_synonymizer.sqlite
)curie_to_pmids.sqlite
)COHDdatabase_vX.X.db
) @chunyumaLogModel.pkl
,GRAPH.sqlite
) @chunyumaDTD_probability_database.db
) @chunyumaNote: As databases are rebuilt, the new copy of configv2.json will need to be updated to point to their new paths.
other to do's:
kg2.5integration
branch)kg2.5integration
branch @finnaginfinally: