CatalogueOfLife / data

Repository for COL content
7 stars 2 forks source link

Overlapping data sources in Fabaceae #597

Open aoern opened 9 months ago

aoern commented 9 months ago

All the genera from source WWW (Acacia, Acaciella, Faidherbia, Mariosousa, Parasenegalia, Pseudosenegalia, Senegalia and Vachellia) are imported also from source WCVP-Fabaceae.

yroskov commented 9 months ago

Thank you Ari!

That is weird. All 8 genera have been blocked in WCVP: https://github.com/CatalogueOfLife/testing/issues/202#issuecomment-1338259375 https://github.com/CatalogueOfLife/testing/issues/213#issuecomment-1505730574

@mdoering, there are problems with nested sectors in CLB again. Could you please do something with this.

mdoering commented 9 months ago

It rather looks to me like a WCVP identifier have changed issue. All decisions are broken and thus haven't been applied

yroskov commented 9 months ago

It rather looks to me like a WCVP identifier have changed issue.

As for me, use of GSD identifier cannot not guarantee stability of decisions (other operations?) in CLB. What can be done to avoid this? @aoern, perhaps you may have advice on how to manage data and re-use editorial decisions ("block taxon" "block name" "change status of the name", etc.?

yroskov commented 9 months ago

I have blocked 8 WWW genera again in WCVP-Fabacaee and re-sync dataset (https://github.com/CatalogueOfLife/testing/issues/202#issuecomment-1833986178)

But what I have got in Project-Assembly search after the sync and page reload completed: (searched CoL for Acaia, there were still 2 entries for the genus, both choices delivered No classification found for Taxon ID) https://www.checklistbank.org/catalogue/3/assembly?assemblyTaxonKey=NTFRBGCcJ8cD-KV6HfsSZ0

image

yroskov commented 9 months ago

OK. After period of time, a search in the Assembly is working. These genera appeared only once in the CoL now. I hope, the issue is FIXED now.

mdoering commented 9 months ago

Editorial decisions, sectors & species estimates all work the same way. They need up to date source identifiers to be linked correctly. But because these ids can change we also store "metadata" about the source name, i.e. the name itself, the authorship, rank and potentially more like the status and direct parent to disambiguate duplicates. When a source changes we need to make sure the identifiers of all these decisions, sectors, estimates are updated - which is what we call rematching. That can run automatically if configured or has to be done manually - which I understand is your preference. Any decision, sector or estimate listed as broken be the system cannot be used and is ignored. AS we know WCVP has changed identifiers in the past, watching out for this is important. You can check all broken decisions and sectors in the project and we should make sure there are none left before we do a release. Maybe sth to add to the task board?

yroskov commented 9 months ago

Maybe sth to add to the task board?

Maybe.

It looks like unstable identifiers are also common for Bryonames: https://www.checklistbank.org/catalogue/3/sector?limit=100&offset=0&subjectDatasetKey=170394. Sectors are broken with each new update of Bryonames in CLB

mdoering commented 9 months ago

we should notify owners of such datasets, maybe they are simply not aware that this is problematic. And we should flag that ID behaviour in the dataset metadata. As soon as we discover that a large part of identifiers changes it could be flagged automatically.

We have one open issue that would help to stabilise some genus names at least: https://github.com/CatalogueOfLife/backend/issues/1189

It would not help with Bryonames and WCVP though which provide identifiers themselves for genera.