Open aoern opened 2 years ago
the import data from ReptileDB comes with authorship strings written upper case letters, the home pages of ReptileDB show author names in upper case letters. When these are parsed, the combining diacritical marks are taken as separator characters which divide the names into two words, I guess. This leads to strange author names such as CeríAco, Roux-EstéVe, ManaçAs and ŠMíd (instead of Ceríaco, Roux-Estéve, Manaças and Šmíd). In all of these names, the diacritic letter preceding the letters A, V, A and M is encoded using a combining diacritical mark.
@gdower, can we address this in data harvest?
Hi Ari, these are interesting findings. I'll look into the details later. Dealing with combining unicode code points could probably be done. They are not really wrong, but when there is a simple replacement I agree we should use that instead.
When working in the homoglyphs I noticed for the first time the large amount of dashes that exist in unicode. I was told by native british that en dash for example is used a lot and has different meanings than a plain hyphen. It should definitely stay. Other invisible ones should already be removed for newly processed data.
Automatically adjust UTF8 garbarge can be done for some cases, but I feel this problem should better be handled when packaging the source data properly.
It also makes a difference whether you encounter these things in names or elsewhere like a bibliographic citation or vernacular name.
Thanks for catching this, @aoern. I'll try to get it fixed in the next ReptileDB update.
To make it clear, this issue is about scientific name and authorship encoding only. As a matter of fact, at least the combining diacritical marks could be safely removed everywhere. But that's another story.
About hyphen vs. dash issue: Yes, hyphens and dashes have a different meaning. In author names hyphen is the correct character in combined surnames, such as Herrich-Schäffer. See for example https://blog.inkforall.com/hyphen-vs-dash for the different meanings. So, it is safe to replace en dashes with hyphens in names. This is not a big issue, however. There is only one authorship string containing dashes: Angiostoma lamotheargumedoi Falcón–Ordaz, Mendoza–Garfias, Windfield–Pérez, Parra–Olea & Pérez–Ponce de León, 2008
Yes, the current code does most of these very basic character treatments in a very low layer that does not know it is a scientificName. I should probably move some of that to the name parser or even higher logic if needed. That takes some time.
There are a bunch of badly encoded Unicode characters in scientific names and authorship strings in the CoL database:
These 'badly' encoded characters often look like 'well-encoded' ones when displayed, but they are not harmless to applications. Applications that compare names may get confused. As an example, try to find "Nasoona indiana" in CoL homepage search. Not found, because the last 'a' is a Cyrillic a in the database. Or try to find author name Gyllensvärd. You get 3 Gyllensvard's and one Gyllensvärd, but not Calligypona gudruna Gyllensvärd, 1968, because the 'ä' is encoded as 'a' + combining diaeresis. Another side effect is found in ReptileDB authorships. I guess the import data from ReptileDB comes with authorship strings written upper case letters, the home pages of ReptileDB show author names in upper case letters. When these are parsed, the combining diacritical marks are taken as separator characters which divide the names into two words, I guess. This leads to strange author names such as CeríAco, Roux-EstéVe, ManaçAs and ŠMíd (instead of Ceríaco, Roux-Estéve, Manaças and Šmíd). In all of these names, the diacritic letter preceding the letters A, V, A and M is encoded using a combining diacritical mark.
@mdoering, could all these harmful encodings be fixed in an early import stage?
I have written a small program unit that fixes all the bad encodings 1 - 3 described above (in Mar 2022 edition). It also fixes systematic encoding errors in nomen.eumycetozoa.com, WoRMS Bryozoa and Species Fungorum Plus that generate non-sense author names such as Brândza, Müll. and Sánchez (Brândza, Müll., Sánchez). I guess the import data has not been in valid UTF-8 format. The program unit is attached here: TaxonNameFixing.txt It is written in Delphi Pascal, but the used algorithms should be rather easily converted to any language. I´d be happy if it is for any help.
The second attachment FixedNames.txt is generated by the first one applied to the Mar 2022 edition. It lists all the fixes done. It is best readable when a monospace font such as Courier and Lucida Console is used.