CatalogueOfLife / backend

Complete backend of COL ChecklistBank
Apache License 2.0
15 stars 11 forks source link

CoL Identifier style #491

Closed mdoering closed 3 years ago

mdoering commented 5 years ago

For the primary objects (taxa, names, references) we should assign stable identifiers across released versions. Identifiers in the Clearinghouse and CoL are of type text/string and can in theory be anything we'd like them to be.

Integers offer a much smaller memory footprint and are useful for keeping all data in memory, e.g. when assembling. Identifiers will be considered unique within a dataset only and do not need to be globally unique like UUIDs or URIs. If a context mandates them to be globally unique they should be either prefixed by a col namespace, e.g. in a written publication. Or be added to a col base URI/resolver like http://catalogueoflife.org/name/123456. The URI as such should not be used as the id in the strict sense as it would prevent us from changing the URI/resolver/domain over time easily. http://catalogueoflife.org is a rather long domain already.

Having really short ids at hand is useful for humans to memorize and to print/render without taking up too much space. Encoding integers into a different numerical system with a higher base/radix would reduce string length, e.g. hexadecimal, 22 or 26 latin characters plus 10 numbers (i.e. latin 32/36) or BASE64 which is case sensitive and uses + and / to reach 64 unique characters. Examples:

int hex latin29  latin32  latin36  Base64 proquint
base10 base16 base29 base32  base36 base64 32bit
18 12 N L I S babab-babif
1089 441 3BL 343 U9 RB babab-bidad
1781089 1B2D61 4K2TW 3QDD3 126AP Gy1h babir-fujod
4781089 48F421 8S326 6KX33 2UH41 SPQh badam-zibod
12781089 C30621 N43J8 E83K3 7LXY9 wwYh bagag-bimod
2147483647 7FFFFFFF 5MQ9CB9 3ZZZZZZ ZIK0ZJ B///// luzuz-zuzuz
timrobertson100 commented 5 years ago

One issue to consider is it can be difficult (impossible?) to distinguish between 0,O and 1,I,l etc if encoded in a latin character set; especially so if across different fonts. This can be avoided by removing certain characters (1,I,l,O,0 etc) from the palette

There have been studies on readability for this kind of thing - personally I find it easiest with numbers (e.g. my bank account, IP addresses) rather than encoded versions (e.g. copying GBIF DOIs).

mdoering commented 5 years ago

I have added a Latin32 encoding that does not contain the ambiguous characters 1I0O. Looks light a flight booking code now :)

mdoering commented 5 years ago

pronouncable proquints are an interesting solution. The 32bit version of having a fixed length of 2 times 5 character words, each consisting of a cvcvc consonant (c) vowel (v) sequence.

Examples: lusab-babad, gutih-tugad, gutuk-bisog or mudof-sakat

A single 7 char word of the form cvcvcvc .e.g. gutukis has 22 bit=4.1 million options, 8 chars cvcvcvcv .e.g. gutukiso with 24 bits enough for 16 million.

timrobertson100 commented 5 years ago

Suggest you plan for extensibility as 16M is not a lot. <=5 letter groups are also easier to read than e.g. 7 character groups. Perhaps consider 2 groupings of 4 chars knowing you can grow to 2 groupings of 5 chars, and then 3 groupings of 4 etc?

gdower commented 5 years ago

CoL has been criticized in the past for not having stable, resolvable IDs. I'd suggest adding a name space prefix for the GSDs so that IDs are unique and possibly could be accessed by URL or with a ID resolver service.

I agree that it's important to not use ambiguous characters (0 O I 1, etc.) in case URLs are published in print publications.

I'd recommend not using proquints, because it seems like some of them could be confused as scientific names and possibly some shorter scientific names could even be coincidentally replicated as an ID for the wrong taxon (e.g. Biton velox as biton-velox), which will confuse people especially if the wrong taxon page shows up as Google search result from Biton velox keywords in the URL.

Matt suggested that we look at the PURL approach to decoupling IDs from resolvability. It also includes ID prefix name spaces.

ayco-at-naturalis commented 5 years ago

We could make them (globally) unique across releases by including the release version in the ID

On Mon, 23 Sep 2019 at 14:53, Markus Döring notifications@github.com wrote:

For the primary objects (taxa, names, references) we should assign as much as possible stable identifiers across released versions. Identifiers in the Clearinghouse and CoL are of type text/string and can in theory be anything we'd like them to be.

Integers offer a much smaller memory footprint and are really useful for keeping all data in memory. Identifiers will be considered unique within a dataset only do not need to be globally unique like UUIDs or URIs. If they ought to be globally unique they should be either prefixed by a fake namespace col: if the context is clear, e.g. in a written publication. Or be added to a col base URI/resolver. The URI as such should not be used as the id in the strict sense as it allows us to change the URI/resolver/domain over time easily. http://catalogueoflife.org is a rather long domain already.

Having really short ids at hand is useful for humans to memorize and to print/render without taking up too much space. Encoding integers into a different numerical system with a higher base/radix would reduce string length, e.g. hexadecimal, all 26 latin characters plus 10 numbers or BASE64 https://en.wikipedia.org/wiki/Base64#Base64_table which is all case sensitive latin chars plus + and /.

int hex latin+num Base64 10 16 36 64
1.089 441 U9 h1
1.781.089 1B2D61 126AP 6ORx
12.781.089 C30621 7LXY9 MMox

Recommendation is to use integers for internal calculations and expose them as BASE64 strings using - and _ instead of + and / so they do not need any URL encoding.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Sp2000/colplus-backend/issues/491?email_source=notifications&email_token=ABXJ6P6LQ3U77OWY2JJTT4LQLC35HA5CNFSM4IZKOEL2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HNA2ZUQ, or mute the thread https://github.com/notifications/unsubscribe-auth/ABXJ6P654L7H5BDPHHNZCL3QLC35HANCNFSM4IZKOELQ .

--

Met vriendelijke groet,

Ayco Holleman Lead Programmer

+31717519245 - - ayco.holleman@naturalis.nl - www.naturalis.nl Darwinweg 2, 2333 CR Leiden Postbus 9517, 2300 RA Leiden

https://www.naturalis.nl/over-ons https://www.naturalis.nl/lang-leve

mdoering commented 5 years ago

As much as I think PURLs do a good job compared to all the other global identifers I really do not want URLs to be the real identifier. We should definitely define a resolution service (API is in fact one already), but the IDs themselves should be decoupled. As long as we have short and stable ids we can use them in any context and technology easily. We can provide a resolution service that returns JSON, HTML, LD, TXT or whatever comes next. But dealing with URLs locks you unnecessarily into a cage.

@ayco-at-naturalis the point of stable ids is that they do not change across versions unless they are really used for different things. @gdower I agree with you against proquints for that very reason. I was wondering myself already if its wise to have a well pronouncable string that ends up being used like names.

That to me leaves the options of classic integers or Latin32 with the ambiguous characters removed

mdoering commented 5 years ago

@gdower obviously there is always the chance that some id will represent real scientific genera names. ABIES exists in Latin32 or any of the other ones that include alphabetical characters.

timrobertson100 commented 5 years ago

I really do not want URLs to be the real identifier

+1 The key decision is what value should be held on the database table. That decision can be decoupled from external formatting of the ID (e.g. URL, URN etc), serialization formats (e.g. how content negotiation could/should be supported) and resolution (e.g. PURL, DOI, LSID).

I'd recommend not using proquints, because it seems like some of them could be confused as scientific names

Excellent point. Vernaculars too (blue-tit) and across languages (sol-sort Danish for blackbird) etc.

mdoering commented 4 years ago

The latin32 encoding is the favorite at this point and will be implemented

mdoering commented 3 years ago

We need to reopen the issue as there are two issues with the latin32 charset to generate identifiers:

1) we create offensive words like FUCK, ANAL, ARSE 2) we generate organism names, notably genera, that are used as identifiers for other taxa, e.g. PUMA, CAREX, ABIES

Avoiding manually selected identifiers from a deny list is an option, but will always miss some entries and is very difficult to maintain across any language. A simpler solution would be to drop all vowels which are essential in any language to form words. These are just 5 chars less, so we would end up with latin27 instead

mdoering commented 3 years ago

See also https://stackoverflow.com/questions/956556/is-it-irrational-to-sanitize-random-character-strings-for-curse-words

Apparently Microsoft omits the following from their product keys:

0 1 2 5 A E I O U L N S Z

timrobertson100 commented 3 years ago

I read in various places that Microsoft drop 0 1 2 5 A E I O U L N S Z from their product keys. This may be safer than simply vowels.

gdower commented 3 years ago

What if we put a number between every letter? I guess it's still potentially offensive though? 2F4U6C7K9 1F1U1C1K1

mdoering commented 3 years ago

Then things get much harder to en/decode. The beauty with just the alphabet is that you can easily convert back and forth to an integer. I want to keep that

mdoering commented 3 years ago

Microsoft drop 0 1 2 5 A E I O U L N S Z from their product keys

we already drop 0O and 1I in latin32. I reckon they decided to drop 5S and 2Z because they are hard to distinguish if a human needs to read the key with bad eyes. But LN? Dropping the vowels in addition to 0O and 1I as in latin32 is good safe and enough.

timrobertson100 commented 3 years ago

Just to mention there are also IDs like PUMA, DUCK which are not ideal, and names like MATT that would also be removed by stripping vowels.

olafbanki commented 3 years ago

Agree to stripping vowels. @mdoering if new IDs need to be re-issued this should be done at the earliest convenience before users start to use the new API more heavily.

mdoering commented 3 years ago

Agree. I will do this monday first thing then. Do you, @chantalhuijbers or @dhobern want to send out a quick communication that the IDs will have to be changed on monday and should not be regarded as stable until then? (temporary) blog post & API mailing list maybe? Or at least to Niels...

olafbanki commented 3 years ago

Sounds good Markus, many thanks

gdower commented 3 years ago

I might need to re-run conversion again to generate the new ID map. That would mean that the new ID mapping would be available sometime on Tuesday.

mdoering commented 3 years ago

we implemented what we call now LATIN29 in the code, i.e. LATIN32 minus the vocals resulting in the following 29 case insensitive chars:

23456789BCDFGHJKLMNPQRSTVWXYZ

Adding examples to the top list