CatalogueOfLife / data

Repository for COL content
7 stars 2 forks source link

Invalid characters in scientific names and authorships #419

Open aoern opened 2 years ago

aoern commented 2 years ago

There are a bunch of badly encoded Unicode characters in scientific names and authorship strings in the CoL database:

  1. Characters that do not belong to the Latin script. These Greek and Cyrillic letters were reported earlier in #409.
  2. Unicode combining diacritical marks. These out of date code points offer an alternative way to encode diacritical characters using two code points, the base character followed by the diacritical mark. For example, name Döring (6 letters) may be encoded as a 7-character string where the ö is encoded as o + combining diaeresis ¨.
  3. Invisible Unicode characters that are hints to the displaying application, such as Soft Hyphen or No-Break Space, or characters that look like the intended character, for example En Dash instead of Hyphen.

These 'badly' encoded characters often look like 'well-encoded' ones when displayed, but they are not harmless to applications. Applications that compare names may get confused. As an example, try to find "Nasoona indiana" in CoL homepage search. Not found, because the last 'a' is a Cyrillic a in the database. Or try to find author name Gyllensvärd. You get 3 Gyllensvard's and one Gyllensvärd, but not Calligypona gudruna Gyllensvärd, 1968, because the 'ä' is encoded as 'a' + combining diaeresis. Another side effect is found in ReptileDB authorships. I guess the import data from ReptileDB comes with authorship strings written upper case letters, the home pages of ReptileDB show author names in upper case letters. When these are parsed, the combining diacritical marks are taken as separator characters which divide the names into two words, I guess. This leads to strange author names such as CeríAco, Roux-EstéVe, ManaçAs and ŠMíd (instead of Ceríaco, Roux-Estéve, Manaças and Šmíd). In all of these names, the diacritic letter preceding the letters A, V, A and M is encoded using a combining diacritical mark.

@mdoering, could all these harmful encodings be fixed in an early import stage?

I have written a small program unit that fixes all the bad encodings 1 - 3 described above (in Mar 2022 edition). It also fixes systematic encoding errors in nomen.eumycetozoa.com, WoRMS Bryozoa and Species Fungorum Plus that generate non-sense author names such as Brândza, Müll. and Sánchez (Brândza, Müll., Sánchez). I guess the import data has not been in valid UTF-8 format. The program unit is attached here: TaxonNameFixing.txt It is written in Delphi Pascal, but the used algorithms should be rather easily converted to any language. I´d be happy if it is for any help.

The second attachment FixedNames.txt is generated by the first one applied to the Mar 2022 edition. It lists all the fixes done. It is best readable when a monospace font such as Courier and Lucida Console is used.

yroskov commented 2 years ago

the import data from ReptileDB comes with authorship strings written upper case letters, the home pages of ReptileDB show author names in upper case letters. When these are parsed, the combining diacritical marks are taken as separator characters which divide the names into two words, I guess. This leads to strange author names such as CeríAco, Roux-EstéVe, ManaçAs and ŠMíd (instead of Ceríaco, Roux-Estéve, Manaças and Šmíd). In all of these names, the diacritic letter preceding the letters A, V, A and M is encoded using a combining diacritical mark.

@gdower, can we address this in data harvest?

mdoering commented 2 years ago

Hi Ari, these are interesting findings. I'll look into the details later. Dealing with combining unicode code points could probably be done. They are not really wrong, but when there is a simple replacement I agree we should use that instead.

When working in the homoglyphs I noticed for the first time the large amount of dashes that exist in unicode. I was told by native british that en dash for example is used a lot and has different meanings than a plain hyphen. It should definitely stay. Other invisible ones should already be removed for newly processed data.

Automatically adjust UTF8 garbarge can be done for some cases, but I feel this problem should better be handled when packaging the source data properly.

It also makes a difference whether you encounter these things in names or elsewhere like a bibliographic citation or vernacular name.

gdower commented 2 years ago

Thanks for catching this, @aoern. I'll try to get it fixed in the next ReptileDB update.

aoern commented 2 years ago

To make it clear, this issue is about scientific name and authorship encoding only. As a matter of fact, at least the combining diacritical marks could be safely removed everywhere. But that's another story.

About hyphen vs. dash issue: Yes, hyphens and dashes have a different meaning. In author names hyphen is the correct character in combined surnames, such as Herrich-Schäffer. See for example https://blog.inkforall.com/hyphen-vs-dash for the different meanings. So, it is safe to replace en dashes with hyphens in names. This is not a big issue, however. There is only one authorship string containing dashes: Angiostoma lamotheargumedoi Falcón–Ordaz, Mendoza–Garfias, Windfield–Pérez, Parra–Olea & Pérez–Ponce de León, 2008

mdoering commented 2 years ago

Yes, the current code does most of these very basic character treatments in a very low layer that does not know it is a scientificName. I should probably move some of that to the name parser or even higher logic if needed. That takes some time.