CatalogueOfLife / backend

Complete backend of COL ChecklistBank
Apache License 2.0
15 stars 11 forks source link

Version dataset metadata independent of imports #1358

Open yroskov opened 1 month ago

yroskov commented 1 month ago

Describe the bug

New CoL release of September 2024 contains wrong metadata for World Plants and World Ferns.

Real versions of both GSDs are 19.4, Jun 2024 / 2024-06-30. (Indeed, new data versions were imported in CLB in Spetember 2024, but they were not synced by me in the CoL of September!). However, these incorrect versions (as 24.9, Sep 2024) are shown in GSD metadata in the September release:

image

https://www.catalogueoflife.org/data/dataset/1140

https://www.catalogueoflife.org/data/dataset/1141

yroskov commented 1 month ago

@thomasstjerne & @mdoering, could you please fix this long standing bug? GSD metadata in the CoL should reflect the version which was synced in the project, but not the version currently imported into the CLB.

(just in case, WFerns GSD was synced 2024-07-09; WPlants - 2024-07-08)

mdoering commented 1 month ago

The September edition was released on 2024-09-25.

Ferns were last imported 18th September and in July before that:

image

The fern sectors were synced last on the 30th September: https://api.checklistbank.org/dataset/3/sector/sync?datasetKey=1140 datasetAttempt: 66 # this is the version of the dataset import: https://api.checklistbank.org/dataset/1140/66.json

Before that on the 9th of July. datasetAttempt: 65 https://api.checklistbank.org/dataset/1140/65.json

The metadata for import 65 indeed looks odd:

"attempt":65,
"issued":"2024-09-18",
"version":"24.9, Sep 2024",
"created":"2024-09-18T14:01:09.739245",
"imported":"2024-07-08T14:10:37.195155"

@yroskov this problem was never mentioned to me before and I am very surprised to see this now. It was working now for more than 2 years.f

yroskov commented 1 month ago

The fern sectors were synced last on the 30th September Yes, it is my today's work for CoL of October

this problem was never mentioned to me before and I am very surprised to see this now. It was working now for more than 2 years.f

I raised this many times during our stands up... (especially, in relation to IRMNG)

mdoering commented 1 month ago

I believe I know what's going on. If you download the last archives they all lack metadata! That must be linked into wrong archival of metadata versions. I will look more into this tomorrow

mdoering commented 1 month ago

I raised this many times during our stands up... (especially, in relation to IRMNG)

Can you point me to an old issue please?

mdoering commented 1 month ago

Dataset metadata is only archived during imports, i.e. when no metadata is included in the archive there won't be any archival. And as the dataset metadata version is tied to the import attempt, it requires considerable refactoring to change that. The idea was that we do not want to archive every manual edit that is being done on a dataset, but instead allow manual changes via the UI or API to happen and only write a final version to the archive when a new one, through an import, shows up.

It seems we now rather need an independent metadata versioning system that has its own version number and will be triggered to archive a version when:

Every import and sync would then refer to a specific metadata version which can be retrieved from the archive.

mdoering commented 1 month ago

@yroskov @gdower a quick fix from my side is not possible, this will take longer. Maybe we can add metadata.yaml files to these sources?

yroskov commented 1 month ago

Maybe we can add metadata.yaml files to these sources?

Unfortunately, this can happen with any source. For example, quite often we get a notification about a new ITIS and do an import a few days before the release, without including that update in the release.

...and this happen to almost all GSDs imported by "third parties" out of our control, e.g. WCVP, WFO, Bryonames, all Lepidoptera, etc.

mdoering commented 1 month ago

but datasets with metadata in imports are versioned fine, they are not a problem!