Implement SetID endpoint

cbizon commented 6 months ago

The new TRAPI MCQ spec requires set creators to give the set an ID. It is generally desirable that the same set of things is identified in the same way. We can define an endpoint on NN that will take a set of identifiers, normalize them, and return some sort of an order-independent hash of the normalized identifiers.

gaurav commented 4 months ago

Here's my proposed API:

GET /get_set_id
- curie: multiple CURIEs to be normalized
- conflation: multiple conflations to be applied (i.e. GeneProtein, DrugChemical)

Return value:

{
  "curies": ["curie1", "curie2", "curie3", "curie4"],
  "conflations": ["GeneProtein", "DrugChemical"],
  "normalized_curies": ["curie1", "curie2_resolved", "curie3_and_4_resolved"],
  "sha256hash": "2c1b0a2a03901aeb2fb84ec870647eaeb53a415e45489b9ff43a7df2e563667c"
}

Note that normalized_curies will be sorted and de-duplicated, so that an input set of identifiers that resolve to the same set of identifiers will also produce the same sha256hash. The user can then produce their own hash with the normalized_curies field or just use our sha256hash (which we will produce by concatenating the normalized CURIEs together, i.e. curie1||curie2_resolved||curie3_and_4_resolved).

For the GET request, making one query per set is probably fine. For the POST request, we will want to additionally support sending in multiple sets to be normalized at once, which will probably need something like:

{
  "curies": {
    "set1": ["curie1", "curie2", "curie3"],
    "set2": ["curie1", "curie2", "curie4"]
  }
}

gaurav commented 4 months ago

In the future, we might want to make this reversible. We could do that by storing the generated set ID, but it might be easier to generate a reversible ID. The dumb way to do that would be to concat the normalized CURIEs together, but that could get very long. A better idea might be to create that string, gzip it, then dump it into a base64 string. Some experimentation would be necessary, but CURIEs have so many repeated elements than I'm optimistic that we could compress them quite a bit. However, we don't currently have a reversible ID case and might never have one, so we shouldn't think of this unless it's pretty trivial to implement.

It might also be a good idea to stick a version number in the set ID.

gaurav commented 4 months ago

Question: if we can't normalize an identifier, should we stick an (unknown) in there or should we just use the un-normalized CURIE?

cbizon commented 3 months ago

I think the unnormalized curie, otherwise all unknowns will look alike

gaurav commented 3 months ago

This has now been implemented in PR #274. I'll add another ticket to create a POST endpoint (https://github.com/TranslatorSRI/NodeNormalization/issues/283) and close this one.

gaurav commented 3 months ago

@cbizon This is now deployed up to TEST (https://nodenorm.test.transltr.io/docs#/default/get_setid_get_setid_get) and will go to PROD with the rest of Fugu shortly. Please let me know if you or the other MCQ folks run into any issue with this!

TranslatorSRI / NodeNormalization

Implement SetID endpoint #256