gbif / ingestion-management

Tracking of data issues seen during data ingestion processes
Apache License 2.0
1 stars 0 forks source link

Identifiers validation failed for dataset The vascular plants of south-east Greenland 60°04' to 64°30' N.Lat. 1933 #984

Open gbif-pipelines opened 9 months ago

gbif-pipelines commented 9 months ago

Identifier validation failed for the dataset The vascular plants of south-east Greenland 60°04' to 64°30' N.Lat. 1933:

New IDs sample:

1592
1591
1317
1590
1330
1107
1106
1105
1327
1326
Old IDs sample:

0001
0002
0003
0004
0005
0006
0007
0008
0009
0010
Publisher email Hello, I am contacting you from the GBIF Secretariat about a dataset published by the [The vascular plants of south-east Greenland 60°04' to 64°30' N.Lat. 1933](https://registry.gbif.org/dataset/41ff9b4b-190f-4dee-bf88-185d425ac0f5) : https://doi.org/10.15468/v4e6gm. We noticed that the occurrenceIDs were changed. We have temporarily paused the ingestions of this dataset. As you might already know, when an occurrence record has a new occurrenceID for a given dataset, our system considers it to be a new occurrence. This means that it will be given a new gbifid and a new occurrence URL (like this one: https://www.gbif.org/occurrence/1252968762) and the old gbifid and URL will be deprecated. In this case, this means that the occurrence URLs would be deprecated when ingesting the newest versions of these datasets. We would like to check with you if those changes were intentional. Do you know if this is the case? Please let us know, thanks! We are happy to resume the dataset ingestion. Note that some users rely on those occurrence URLs and gbifids (like https://bionomia.net for example). In an attempt to improve the stability of the occurrence URLs and gbifids, we have implemented a warning system to detect these type of changes in datasets (see this news item). If the data publisher can provide us with a list of old and new occurrenceIDs per record, we can avoid the identifier and URL changes. Could that be an option? Please let us know if you have any question. Thanks! All the best,

You can skip/fix identifier validation using the registry UI.

ManonGros commented 9 months ago

@DanBIF the ingestion was paused for this dataset: The vascular plants of south-east Greenland 60°04' to 64°30' N.Lat. 1933 (on the DANBIF IPT: https://danbif.au.dk/ipt/resource?r=vascular_plants_of_south_east_greenland_1933) because 999 occurrences out of the 1,826 had new occurrenceIDs.

Do you know if this change was made on purpose? Do you think we should resume the dataset's ingestion? Let us know. Thanks!

DanBIF commented 9 months ago

Dear Marie, Thank you. The data originator in Greenland, Ida Bomholt Dyrholm Jacobsen @.**@.>) has registration rights on our IPT (after I trained her a while ago), so I was not aware of this. I think it best to contact her directly, should I do this, or will you? Kind regards, Isabel

Isabel Calabuig M.Sc., Ph.D. biologist, Node manager, Data curator Natural History Museum of Denmark DIR +45 35321103 / MOB +45 22136624

From: Marie Grosjean @.> Sent: 2. januar 2024 15:21 To: gbif/ingestion-management @.> Cc: Isabel Calabuig @.>; Mention @.> Subject: Re: [gbif/ingestion-management] Identifiers validation failed for dataset The vascular plants of south-east Greenland 60°04' to 64°30' N.Lat. 1933 (Issue #984)

@DanBIFhttps://github.com/DanBIF the ingestion was paused for this dataset: The vascular plants of south-east Greenland 60°04' to 64°30' N.Lat. 1933https://www.gbif.org/dataset/41ff9b4b-190f-4dee-bf88-185d425ac0f5 (on the DANBIF IPT: https://danbif.au.dk/ipt/resource?r=vascular_plants_of_south_east_greenland_1933https://danbif.au.dk/ipt/resource?r=vascular_plants_of_south_east_greenland_1933) because 999 occurrences out of the 1,826 had new occurrenceIDs.

Do you know if this change was made on purpose? Do you think we should resume the dataset's ingestion? Let us know. Thanks!

— Reply to this email directly, view it on GitHubhttps://github.com/gbif/ingestion-management/issues/984#issuecomment-1874086083, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB3UABJDMS247WYTMMFXOBLYMQJV7AVCNFSM6AAAAABBHPW67KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZUGA4DMMBYGM. You are receiving this because you were mentioned.Message ID: @.**@.>>

ManonGros commented 9 months ago

Thank you @DanBIF It looks like Ida (@IBDJ) already noticed that the dataset wasn't being ingested. Ida, our system pauses datasets when it detects a change in the values of occurrenceIDs. In this case, it looks like 999 occurrences out of the 1,826 had new occurrenceIDs. When an occurrence has a new identifier, we create a new entry with a new URL for it (and delete the old one). In order to avoid accidental occurrenceID change (leading to the possible deletion of many existing occurrences). You can learn more about this initiative by reading this blogpost or watching this video.

Here is what can be done:

I hope this helps. Let us know if you have any question or if anything remains unclear. Thanks!

IBDJ commented 8 months ago

Hi @ManonGros Thank you for reaching out about. I am sorry that I have not replied. I will have to check my notifications from GitHub.

I was indeed aware that it wasn't ingesting. However, I will have to look in to checking the occurrenceIDs. Since this was an issue I have experienced before I was quite carefull to keep the occurrenceID the same. And from the first example with the old and new occurenceIDs this is still the same format.

IBDJ commented 8 months ago

Hi again @ManonGros So I did discover that it must have been a formatting issue as I was just using numbers (but in a four digit format eg 0001) as occurrenceIDs but I didn't pay attention to the formatting so just became numbers. I just tried fixing it my self, but is seems that the old ressource can't be archived when this occurrence issue is there.

If I were to roll back the changes ("and start over"), where would I do that?

ManonGros commented 8 months ago

Thanks for checking @IBDJ ! About what you write:

is seems that the old ressource can't be archived when this occurrence issue is there

Did you get any message that made you think so?

Normally, if you update the data to use the same identifiers as before and click on the "publish" button, the dataset should be ingested without issues. This is what I meant by writing "roll back changes". If you did that and encountered any error message, could you let me know which ones? Thanks!

Note that I can also ingest the dataset with the new identifiers. It isn't ideal but it is easy and won't cause more issues.

IBDJ commented 8 months ago

Hi @ManonGros Yes, I got an error message saying that the ressources couldn't be archived. I will do it again and get you the exact error message. Thank you for following up.

DanBIF commented 8 months ago

Hi, on my side there has been a system update over the weekend, and I have also now updated extension (occurrence) - might have been the cause of trouble?

ManonGros commented 4 months ago

Hi @IBDJ @DanBIF, I can see that this dataset's ingestion is still paused: https://www.gbif.org/dataset/41ff9b4b-190f-4dee-bf88-185d425ac0f5 Did you have a chance to try republishing the data with the old identifiers? I can also resume the dataset ingestion with the new identifiers if this makes things easier. Just let me know, thanks!

DanBIF commented 4 months ago

Thanks @ManonGros I will let @IBDJ respond as I have not much knowledge on the state of this dataset ;-)