facilityregistry / fred-api

Facility Registry API Documentation Website
11 stars 4 forks source link

Using GUIDs for IDs #26

Closed nihilchk closed 11 years ago

nihilchk commented 11 years ago

Currently the spec allows IDs to not be globally unique. The FRED specification says that "the system will not return the same system id twice when creating two facilities", but ids might clash between registries.

This can cause a problem in the following situations:

We suggest that IDs should be 128 bit GUIDs. This will avoid all problems of ID clashes between different systems. Libraries for generating GUIDs are easily avaialble for all platforms and programming languages.

Implementations would be free to choose shorter IDs to include in URLs because they control the namespace of URL and therefore there is no chance of collision.

We haven't been able to catch up with Morten about this yet. Hopefully this will chime in with his perspective shortly.

-Bharti & Chris

mortenoh commented 11 years ago

Feel free to refer to them as UUIDs, since GUIDs are Microsofts implementation of RFC 4122.

We (DHIS 2) do not feel this change in the spec is necessary, and so should be left up to the implementing system. As it is already defined.

Morten

On Tue, Dec 18, 2012 at 3:35 PM, Bharti Nagpal notifications@github.comwrote:

Currently the spec allows IDs to not be globally unique. The FRED specification says that "the system will not return the same system id twice when creating two facilities", but ids might clash between registries.

This can cause a problem in the following situations:

  • An index that refers to facilities in multiple countries e.g. Rwanda and Uganda.
  • While merging the list of facilities from more than one registry.
  • When migrating facilities from one registry to another. (This is a problem for our current project in Uganda).

We suggest that IDs should be 128 bit GUIDs. This will avoid all problems of ID clashes between different systems. Libraries for generating GUIDs are easily avaialble for all platforms and programming languages.

Implementations would be free to choose shorter IDs to include in URLs because they control the namespace of URL and therefore there is no chance of collision.

We haven't been able to catch up with Morten about this yet. Hopefully this will chime in with his perspective shortly.

-Bharti & Chris

— Reply to this email directly or view it on GitHubhttps://github.com/facilityregistry/fred-api/issues/26.

bobjolliffe commented 11 years ago

I think we must be clearer on the semantics of identifiers and Chris you are right to point to https://github.com/facilityregistry/fred-api/issues/27.

What this reference points to is that the (unspoken) use of identifiers within the draft spec as it stands implies some system generated id which is suitable for use in a URL. (I would go further and state that it be suitable for use as an ID/IDREF. I know we are all jsonified at the moment, but we will still want to use these ids as IDREFS within xml documents).

Existing implementations that we know of - eg. resource mapper uses an integer for this, DHIS uses an 11 char string.

I understand the use case for a uuid but I think the semantics of this is sufficiently different to not try to mangle them into the same id. uuids are not suitable as ID/IDREFs for example and make horrible urls.

I would propose as a resolution to this issue, an additional (and optional) core property called uuid which conforms to the OSF spec (provide normative reference and layout preference).

(In DHIS2 we created our 11 char string in an attempt to mangle these semantics. I'm not sure how good an idea that was but it works well for all our metadata. For that reason an additional uuid is certainly possible, but not actually necessary. Hence would prefer that uuid be optional.)

And to improve the language around the current description of id to highlight it's relationship with issue 27.

mberg commented 11 years ago

core property id represents the UID.

The intent of the id is it should be universally unique like a UID.

ctford commented 11 years ago

@bobjolliffe To me, the point of an id field is that it supports identity in the face of changing URLs (as in, for example, Atom feeds). Without that concept, the url field would be sufficient to identify the facility. It seems that by supporting an identity separate from the URL we are acknowledging that we may want to migrate or merge lists of facilities, and that therefore we need universally unique ids.

@mberg Is there an advantage to specifying the id must be universally unique without settling on a scheme? Do we envisage people using unique ids that aren't UUIDs?

bobjolliffe commented 11 years ago

On 19 December 2012 20:00, Chris Ford notifications@github.com wrote:

@bobjolliffe https://github.com/bobjolliffe To me, the point of an id field is that it supports identity in the face of changing URLs (as in, for example, Atom feeds). Without that concept, the url field would be sufficient to identify the facility. It seems that by supporting an identity separate from the URL we are acknowledging that we may want to migrate or merge lists of facilities, and that therefore we need universally unique ids.

There are many points for an id field. Which lead to many characteristics. Most of which come down to the scope of uniqueness, the length, randomness vs curated etc.

Ids which are useful for patients are not necessarily ideal for driving licences. Ids for uniquely identifying hard discs are not necessarily ideally suited for clinics.

This particular one seems to be targeted at use in the composition of a url using the http: or https: scheme. At least that is how it is referred in the rest of the document.

The url field (or perhaps better a uri field) would also indeed be sufficient for uniquely identifying a facility. Having a uri field would make a lot of sense and would allow the incorporation of schemes like urn:uuid:... etc.

I'd be happy enough to see that, but still reluctant to coerce this requirement to the system id field and thus oblige implementations to use it in the composition of urls.

@mberg https://github.com/mberg Is there an advantage to specifying the

id must be universally unique without settling on a scheme? Do we envisage people using unique ids that aren't UUIDs?

Good question. DHIS2 for example generates a pseudo random identifier using 11 hexadecimal digits with the restriction that the first character must be a letter. I don't think that Matt has said that the id "must be universally unique". Nobody could measurably claim compliance to that. He says "the intent of the id is it should be universally unique". With a collision space of 10^13 that should be the case for internal identifiers for facilities in DHIS2.

We could easily also generate a 128 bit uuid if that was required. Though we did have these for years and finally trimmed them down last year to a more friendly scheme so it would be sad to have to go back on that just to satisfy a particular fundamentalist perspective on randomness. (the impact would be quite far-reaching as our id scheme applies to every metadata item in DHIS not just facilities).

And if we did then we would probably still be reluctant to use these in urls.

Bob

PS. The XML spec for ID/IDREF has some useful formulation which also takes into account non-ASCII UTF-8 characters.

[4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] [4a] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040] [5] Name ::= NameStartChar (NameChar)*

for me it is important that an identifier should also be a valid ID/IDREF.

— Reply to this email directly or view it on GitHubhttps://github.com/facilityregistry/fred-api/issues/26#issuecomment-11543029.

ctford commented 11 years ago

@bobjolliffe:

I'd be happy enough to see that, but still reluctant to coerce this requirement to the system id field and thus oblige implementations to use tit in the composition of urls.

I agree that a full-length UUID would be awkward in the URL, but we don't need it to be. If id is for identity, and url for retrieving the resource, then perhaps we should be content with having them independent.

If we're considering URNs then I presume we aren't intending to oblige people to dump the whole id (including scheme identifier!) into the URL anyway.

@bobjolliffe:

There are many points for an id field. Which lead to many characteristics. Most of which come down to the scope of uniqueness, the length, randomness vs curated etc.

The nice thing about OSF UUIDs is that there's a well-understood, decentralised and widely-supported method for generating them, which means that a list of facilities is portable and merge-able.

If we took a bunch of facilities from one system and imported them into DHIS2, then DHIS2 would need to be able to add facilities in a way that respects the ids of the existing ones. Using OSF UUIDs would make that straightforward. A mixture of id schemes would be a pain.

ctford commented 11 years ago

@bobjolliffe:

We could easily also generate a 128 bit uuid if that was required. Though we did have these for years and finally trimmed them down last year to a more friendly scheme so it would be sad to have to go back on that just to satisfy a particular fundamentalist perspective on randomness. (the impact would be quite far-reaching as our id scheme applies to every metadata item in DHIS not just facilities).

To reiterate, there's no need to change the internal ids used by DHIS2 and no need to expose a large UUID in the URL. The advantage of having ids generated and formatted in an agreed way is that we can decouple them from the internal design of the systems hosting them e.g. Resource Map or DHIS2. If we want our data to live beyond the systems that host it, that's important.

In DHIS2, for example, an id could be generated and attached to a facility in the same way as any other field like latitude or name.