Closed mdoering closed 3 years ago
One issue to consider is it can be difficult (impossible?) to distinguish between 0,O and 1,I,l etc if encoded in a latin character set; especially so if across different fonts. This can be avoided by removing certain characters (1,I,l,O,0 etc) from the palette
There have been studies on readability for this kind of thing - personally I find it easiest with numbers (e.g. my bank account, IP addresses) rather than encoded versions (e.g. copying GBIF DOIs).
I have added a Latin32
encoding that does not contain the ambiguous characters 1I0O
.
Looks light a flight booking code now :)
pronouncable proquints are an interesting solution. The 32bit version of having a fixed length of 2 times 5 character words, each consisting of a cvcvc consonant (c) vowel (v) sequence.
Examples: lusab-babad
, gutih-tugad
, gutuk-bisog
or mudof-sakat
A single 7 char word of the form cvcvcvc .e.g. gutukis
has 22 bit=4.1 million options, 8 chars cvcvcvcv .e.g. gutukiso
with 24 bits enough for 16 million.
Suggest you plan for extensibility as 16M is not a lot. <=5 letter groups are also easier to read than e.g. 7 character groups. Perhaps consider 2 groupings of 4 chars knowing you can grow to 2 groupings of 5 chars, and then 3 groupings of 4 etc?
CoL has been criticized in the past for not having stable, resolvable IDs. I'd suggest adding a name space prefix for the GSDs so that IDs are unique and possibly could be accessed by URL or with a ID resolver service.
I agree that it's important to not use ambiguous characters (0 O I 1, etc.) in case URLs are published in print publications.
I'd recommend not using proquints, because it seems like some of them could be confused as scientific names and possibly some shorter scientific names could even be coincidentally replicated as an ID for the wrong taxon (e.g. Biton velox as biton-velox
), which will confuse people especially if the wrong taxon page shows up as Google search result from Biton velox keywords in the URL.
Matt suggested that we look at the PURL approach to decoupling IDs from resolvability. It also includes ID prefix name spaces.
We could make them (globally) unique across releases by including the release version in the ID
On Mon, 23 Sep 2019 at 14:53, Markus Döring notifications@github.com wrote:
For the primary objects (taxa, names, references) we should assign as much as possible stable identifiers across released versions. Identifiers in the Clearinghouse and CoL are of type text/string and can in theory be anything we'd like them to be.
Integers offer a much smaller memory footprint and are really useful for keeping all data in memory. Identifiers will be considered unique within a dataset only do not need to be globally unique like UUIDs or URIs. If they ought to be globally unique they should be either prefixed by a fake namespace col: if the context is clear, e.g. in a written publication. Or be added to a col base URI/resolver. The URI as such should not be used as the id in the strict sense as it allows us to change the URI/resolver/domain over time easily. http://catalogueoflife.org is a rather long domain already.
Having really short ids at hand is useful for humans to memorize and to print/render without taking up too much space. Encoding integers into a different numerical system with a higher base/radix would reduce string length, e.g. hexadecimal, all 26 latin characters plus 10 numbers or BASE64 https://en.wikipedia.org/wiki/Base64#Base64_table which is all case sensitive latin chars plus + and /.
int hex latin+num Base64 10 16 36 64 1.089 441 U9 h1 1.781.089 1B2D61 126AP 6ORx 12.781.089 C30621 7LXY9 MMox Recommendation is to use integers for internal calculations and expose them as BASE64 strings using - and _ instead of + and / so they do not need any URL encoding.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Sp2000/colplus-backend/issues/491?email_source=notifications&email_token=ABXJ6P6LQ3U77OWY2JJTT4LQLC35HA5CNFSM4IZKOEL2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HNA2ZUQ, or mute the thread https://github.com/notifications/unsubscribe-auth/ABXJ6P654L7H5BDPHHNZCL3QLC35HANCNFSM4IZKOELQ .
--
Met vriendelijke groet,
Ayco Holleman Lead Programmer
+31717519245 - - ayco.holleman@naturalis.nl - www.naturalis.nl Darwinweg 2, 2333 CR Leiden Postbus 9517, 2300 RA Leiden
https://www.naturalis.nl/over-ons https://www.naturalis.nl/lang-leve
As much as I think PURLs do a good job compared to all the other global identifers I really do not want URLs to be the real identifier. We should definitely define a resolution service (API is in fact one already), but the IDs themselves should be decoupled. As long as we have short and stable ids we can use them in any context and technology easily. We can provide a resolution service that returns JSON, HTML, LD, TXT or whatever comes next. But dealing with URLs locks you unnecessarily into a cage.
@ayco-at-naturalis the point of stable ids is that they do not change across versions unless they are really used for different things. @gdower I agree with you against proquints for that very reason. I was wondering myself already if its wise to have a well pronouncable string that ends up being used like names.
That to me leaves the options of classic integers or Latin32 with the ambiguous characters removed
@gdower obviously there is always the chance that some id will represent real scientific genera names. ABIES exists in Latin32 or any of the other ones that include alphabetical characters.
I really do not want URLs to be the real identifier
+1 The key decision is what value should be held on the database table. That decision can be decoupled from external formatting of the ID (e.g. URL, URN etc), serialization formats (e.g. how content negotiation could/should be supported) and resolution (e.g. PURL, DOI, LSID).
I'd recommend not using proquints, because it seems like some of them could be confused as scientific names
Excellent point. Vernaculars too (blue-tit
) and across languages (sol-sort
Danish for blackbird) etc.
The latin32 encoding is the favorite at this point and will be implemented
We need to reopen the issue as there are two issues with the latin32 charset to generate identifiers:
1) we create offensive words like FUCK
, ANAL
, ARSE
2) we generate organism names, notably genera, that are used as identifiers for other taxa, e.g. PUMA
, CAREX
, ABIES
Avoiding manually selected identifiers from a deny list is an option, but will always miss some entries and is very difficult to maintain across any language. A simpler solution would be to drop all vowels which are essential in any language to form words. These are just 5 chars less, so we would end up with latin27
instead
Apparently Microsoft omits the following from their product keys:
0 1 2 5 A E I O U L N S Z
I read in various places that Microsoft drop 0 1 2 5 A E I O U L N S Z
from their product keys. This may be safer than simply vowels.
What if we put a number between every letter? I guess it's still potentially offensive though? 2F4U6C7K9
1F1U1C1K1
Then things get much harder to en/decode. The beauty with just the alphabet is that you can easily convert back and forth to an integer. I want to keep that
Microsoft drop 0 1 2 5 A E I O U L N S Z from their product keys
we already drop 0O and 1I in latin32. I reckon they decided to drop 5S and 2Z because they are hard to distinguish if a human needs to read the key with bad eyes. But LN? Dropping the vowels in addition to 0O and 1I as in latin32 is good safe and enough.
Agree to stripping vowels. @mdoering if new IDs need to be re-issued this should be done at the earliest convenience before users start to use the new API more heavily.
Agree. I will do this monday first thing then. Do you, @chantalhuijbers or @dhobern want to send out a quick communication that the IDs will have to be changed on monday and should not be regarded as stable until then? (temporary) blog post & API mailing list maybe? Or at least to Niels...
Sounds good Markus, many thanks
I might need to re-run conversion again to generate the new ID map. That would mean that the new ID mapping would be available sometime on Tuesday.
we implemented what we call now LATIN29
in the code, i.e. LATIN32 minus the vocals resulting in the following 29 case insensitive chars:
23456789BCDFGHJKLMNPQRSTVWXYZ
Adding examples to the top list
For the primary objects (taxa, names, references) we should assign stable identifiers across released versions. Identifiers in the Clearinghouse and CoL are of type text/string and can in theory be anything we'd like them to be.
Integers offer a much smaller memory footprint and are useful for keeping all data in memory, e.g. when assembling. Identifiers will be considered unique within a dataset only and do not need to be globally unique like UUIDs or URIs. If a context mandates them to be globally unique they should be either prefixed by a
col
namespace, e.g. in a written publication. Or be added to a col base URI/resolver like http://catalogueoflife.org/name/123456. The URI as such should not be used as the id in the strict sense as it would prevent us from changing the URI/resolver/domain over time easily. http://catalogueoflife.org is a rather long domain already.Having really short ids at hand is useful for humans to memorize and to print/render without taking up too much space. Encoding integers into a different numerical system with a higher base/radix would reduce string length, e.g. hexadecimal, 22 or 26 latin characters plus 10 numbers (i.e. latin 32/36) or BASE64 which is case sensitive and uses
+
and/
to reach 64 unique characters. Examples: