GlobalNamesArchitecture / gnparser

Split scientific names to meaningful elements with meta information
https://parser.globalnames.org/
MIT License
20 stars 2 forks source link

Transliteration of umlauts needed #398

Closed mdoering closed 5 years ago

mdoering commented 6 years ago

Not all diacritic marks should be simply removed according to the codes. Some, most prominently the German Umlauts, should be transliterated. See ICNafp article 60

For example the genus Lühea should be spelled Luehea in the canonical name

mdoering commented 6 years ago

@dimus could you explain to me why parsing Lühea vulgaris yields a quality warning "Non-standard characters in canonical", but Isoëtes vulgaris does not?

wollmers commented 6 years ago

IMHO the warning (or not) should be the same.

Compare

ICZN 32.5.2.1. In the case of a diacritic or other mark, the mark concerned is deleted, except that in a name published before 1985 and based upon a German word, the umlaut sign is deleted from a vowel and the letter "e" is to be inserted after that vowel (if there is any doubt that the name is based upon a German word, it is to be so treated).

dimus commented 6 years ago

The reason ë does not generate a warning, is that it is the only diacritic (sadly) not prohibited by botanical code:

60.6. Diacritical signs are not used in scientific names. When names (either new or old) are drawn from words in which such signs appear, the signs are to be suppressed with the necessary transcription of the letters so modified; for example ä, ö, ü become, respectively, ae, oe, ue; é, è, ê become e; ñ becomes n; ø becomes oe; å becomes ao. The diaeresis, indicating that a vowel is to be pronounced separately from the preceding vowel (as in Cephaëlis, Isoëtes), is a phonetic device that is not considered to alter the spelling; as such, its use is optional. The ligatures -æ- and -œ-, indicating that the letters are pronounced together, are to be replaced by the separate letters -ae- and -oe-.

dimus commented 5 years ago

There is no good solution for this, because to parse a particular name correctly the paraser would need to know the year when name was created, the code, the origin etc. So the only way i see is to do it consistently, generating the least amount or errors. Therefore in Go parser I removed ë as a special case. It will break a very few botanical cases, but will fix many botanical and zoological cases.

Closing this ticket here, opening https://gitlab.com/gogna/gnparser/issues/48