CatalogueOfLife / checklistbank

UI for checklistbank.org
https://www.checklistbank.org/
6 stars 2 forks source link

Diff tool - Unlikely name matchs #1332

Closed camiplata closed 7 months ago

camiplata commented 7 months ago

Some example of incorrect name matche of two sources:

  1. On the first blue box Hesionoidea was match to Iphionidae, although both datasets had the name Iphionidae
  2. On the second blue box, two completely different names are match although they do not share most of its characters, and there was a better match for Sigalionidae Captura de pantalla 2023-12-01 a la(s) 10 53 48 a m

Link to the diff tool

  1. The following image shows two unlikely matchs and
  2. A name that had a better match but it was pair with a less likely name
Captura de pantalla 2023-12-01 a la(s) 11 07 34 a m  3

Link to the diff tool

mdoering commented 7 months ago

These are not matches - they simply show where things have changed. It is not always easy to understand these diffs. Don't put any taxonomic meaning into them, they simply show where files differ. It quite often happens that a name was removed and another name added and these 2 are shown as "pairs" when really there is no pairing - they just happen to be in the same place when sorted alphabetically!

mdoering commented 7 months ago

Also this is a based on the regular unix diff software and we do not have any influence on its performance

camiplata commented 7 months ago

I do think there is room for improvement of the tool and the documentation about it.

If changes are shown as pairs like this:

Captura de pantalla 2023-12-01 a la(s) 1 35 52 p m

It intermediately creates the idea on the user that that's how the diff is read for all names. If there is not a pairing the name could be shown alone like this as the tool already does:

Captura de pantalla 2023-12-01 a la(s) 1 37 58 p m

On the other hand if the tool shows differences between datasets I do expect it to retrieve better pairs for example

here with Iphionidae

Captura de pantalla 2023-12-01 a la(s) 1 39 32 p m

and here with Annelida

Captura de pantalla 2023-12-01 a la(s) 1 39 42 p m

should had been paired

Probably this is not a higher priority issue, but if this a tool we are going to offer to the wider public we would need to make it better enventually, and questions/issues like the one I'm raising will eventually arrive again from users.

mdoering commented 7 months ago

Agree it is unfortunate, but there is no way to improve these problems. It can only be done by better documenting what it does