TranslatorSRI / NodeNormalization

Service that produces Translator compliant nodes given a curie
MIT License
9 stars 6 forks source link

Implement SetID endpoint #256

Open cbizon opened 2 months ago

cbizon commented 2 months ago

The new TRAPI MCQ spec requires set creators to give the set an ID. It is generally desirable that the same set of things is identified in the same way. We can define an endpoint on NN that will take a set of identifiers, normalize them, and return some sort of an order-independent hash of the normalized identifiers.

gaurav commented 4 days ago

Here's my proposed API:

Return value:

{
  "curies": ["curie1", "curie2", "curie3", "curie4"],
  "conflations": ["GeneProtein", "DrugChemical"],
  "normalized_curies": ["curie1", "curie2_resolved", "curie3_and_4_resolved"],
  "sha256hash": "2c1b0a2a03901aeb2fb84ec870647eaeb53a415e45489b9ff43a7df2e563667c"
}

Note that normalized_curies will be sorted and de-duplicated, so that an input set of identifiers that resolve to the same set of identifiers will also produce the same sha256hash. The user can then produce their own hash with the normalized_curies field or just use our sha256hash (which we will produce by concatenating the normalized CURIEs together, i.e. curie1||curie2_resolved||curie3_and_4_resolved).

For the GET request, making one query per set is probably fine. For the POST request, we will want to additionally support sending in multiple sets to be normalized at once, which will probably need something like:

{
  "curies": {
    "set1": ["curie1", "curie2", "curie3"],
    "set2": ["curie1", "curie2", "curie4"]
  }
}
gaurav commented 4 days ago

In the future, we might want to make this reversible. We could do that by storing the generated set ID, but it might be easier to generate a reversible ID. The dumb way to do that would be to concat the normalized CURIEs together, but that could get very long. A better idea might be to create that string, gzip it, then dump it into a base64 string. Some experimentation would be necessary, but CURIEs have so many repeated elements than I'm optimistic that we could compress them quite a bit. However, we don't currently have a reversible ID case and might never have one, so we shouldn't think of this unless it's pretty trivial to implement.

It might also be a good idea to stick a version number in the set ID.