TranslatorSRI / NodeNormalization

Service that produces Translator compliant nodes given a curie
MIT License
10 stars 6 forks source link

Add a /get_setid GET endpoint #274

Closed gaurav closed 4 months ago

gaurav commented 4 months ago

Adds a /get_setid GET endpoint, which can be used to calculate a set ID for a set of CURIEs (implementing the specification described in https://github.com/TranslatorSRI/NodeNormalization/issues/256#issuecomment-2197396610). The CURIEs are normalized (note that we don't do any validation to ensure that they are valid CURIEs or even that they look like CURIEs), sorted, and then returned as an SHA-256 hash. This PR also modifies /get_allowed_conflations so that it returns DrugChemical in additional to GeneProtein.

This PR does not include tests for conflations, but I'll add that in a separate PR. It also does not include a POST endpoint that would allow multiple sets to be submitted as once; I'll write that once this PR has been reviewed and merged.

This PR also includes a base64 representation of the normalized string (which is generally larger than the normalized string) and a compressed version of the normalized string in base64 (which isn't that much shorter). Neither of these seem like useful ways to generate a set ID, but I'm going to leave them in for now.

gaurav commented 4 months ago

I think(?) the sha hash for brevity. But could we go even smaller? sha64hash?

Not really -- there is SHA-1, which is 160 bits, but I believe it's possible to generate multiple messages that hash to the same SHA-1 value -- I don't think that's relevant to our usage, but that seems like a bad idea to standardize on. There is also SHA-224, which is SHA-2 but with slightly fewer bits (224 vs 256), which is a little shorter but not by much. We could try something like taking an SHA-256 hash and then only using the first 10 characters or something -- I don't know how that changes the probabilities of getting duplicates. By way of comparison, a UUID is only 128 bits, so it's possible to generate numbers of about that length and still be pretty sure about uniqueness, so we could try taking the first half of an SHA-256 hash (which would be 128 bits, not 64), but I don't know of a good way to ensuring uniqueness.

I think maybe we go with SHA-224 for now? It's only slightly worse than SHA-256, and is still a standard hash without us having to do anything else with it.

I suspect that the identifier needs to have the form of a curie to pass TRAPI validation.

How about something like SETID:1[identifier]? The 1 could be used as a version string, so if we come up with a better way of generating hashes in the future, we could use them as SETID:2[identifier]... or is that overengineering, and we should just stick with SETID:[identifier]?

cbizon commented 4 months ago

What about UUIDs? "uuid" is already a resolvable namespace I think?

gaurav commented 4 months ago

Hmm, uuid: is a pretty standard prefix but I don't see it in the Biolink Model prefix map, which is where I think the prefix would need to be in order to be validated? Once we choose on a prefix, we can probably get it into the Biolink Model pretty quickly if we need to.

We could use UUID v5 identifiers with a custom namespace -- they're based on an SHA1 of the input text, in combination with a namespace identifier (itself a UUID), and are supported by Python's uuid library. UUIDs are only 128 bits long, so not quite as well-defended against collisions than SHA-224, but it's probably good enough for our purposes. I'll implement this tomorrow and we can see what it looks like.

gaurav commented 4 months ago

@cbizon I've updated the setid GET endpoint to return UUID v5 with a custom namespace ('14ef168c-14cb-4979-8442-da6aaca55572'). The setid now has a uuid: prefix as well, e.g. uuid:fa89b48e-22d0-53c2-8e1a-e32e1fac6f4c.