Closed cbizon closed 3 months ago
Here's my proposed API:
/get_set_id
curie
: multiple CURIEs to be normalizedconflation
: multiple conflations to be applied (i.e. GeneProtein
, DrugChemical
)Return value:
{
"curies": ["curie1", "curie2", "curie3", "curie4"],
"conflations": ["GeneProtein", "DrugChemical"],
"normalized_curies": ["curie1", "curie2_resolved", "curie3_and_4_resolved"],
"sha256hash": "2c1b0a2a03901aeb2fb84ec870647eaeb53a415e45489b9ff43a7df2e563667c"
}
Note that normalized_curies
will be sorted and de-duplicated, so that an input set of identifiers that resolve to the same set of identifiers will also produce the same sha256hash. The user can then produce their own hash with the normalized_curies
field or just use our sha256hash (which we will produce by concatenating the normalized CURIEs together, i.e. curie1||curie2_resolved||curie3_and_4_resolved
).
For the GET request, making one query per set is probably fine. For the POST request, we will want to additionally support sending in multiple sets to be normalized at once, which will probably need something like:
{
"curies": {
"set1": ["curie1", "curie2", "curie3"],
"set2": ["curie1", "curie2", "curie4"]
}
}
In the future, we might want to make this reversible. We could do that by storing the generated set ID, but it might be easier to generate a reversible ID. The dumb way to do that would be to concat the normalized CURIEs together, but that could get very long. A better idea might be to create that string, gzip it, then dump it into a base64 string. Some experimentation would be necessary, but CURIEs have so many repeated elements than I'm optimistic that we could compress them quite a bit. However, we don't currently have a reversible ID case and might never have one, so we shouldn't think of this unless it's pretty trivial to implement.
It might also be a good idea to stick a version number in the set ID.
Question: if we can't normalize an identifier, should we stick an (unknown)
in there or should we just use the un-normalized CURIE?
I think the unnormalized curie, otherwise all unknowns will look alike
This has now been implemented in PR #274. I'll add another ticket to create a POST endpoint (https://github.com/TranslatorSRI/NodeNormalization/issues/283) and close this one.
@cbizon This is now deployed up to TEST (https://nodenorm.test.transltr.io/docs#/default/get_setid_get_setid_get) and will go to PROD with the rest of Fugu shortly. Please let me know if you or the other MCQ folks run into any issue with this!
The new TRAPI MCQ spec requires set creators to give the set an ID. It is generally desirable that the same set of things is identified in the same way. We can define an endpoint on NN that will take a set of identifiers, normalize them, and return some sort of an order-independent hash of the normalized identifiers.