Open gbif-portal opened 2 years ago
I do not know if this have been reported before, so sorry if you are already aware. I just want to say the same problem happens with occurrence records, the validator does not show problems when there are duplicate occurrenceIDs. Here is the link of the validation https://www.gbif.org/tools/data-validator/dbd36711-0d0f-4259-af3f-84b196811135 and the data UNAL_UNABx3_20220223_Revision.xlsx
@EstebanMH-SiB I think it is possible to highlight that there are duplicates as a warning, but we don't decline data because it has some duplicated identifiers
You can see number for the validation
Number of lines 1,340 Number of records indexed 60
Where numbers of lines - actual number of lines in the file Number of records indexed - number of records you will see on the portal
I created issue here https://github.com/gbif/pipelines/issues/679
As a real example eBird for 2020 had: Records in the file: 741,037,572 Records in the index : 705,008,469
So ~36mil records in the file were duplicated
Thanks for the fast answer @muttcg. We agree with the warning, that will be useful.
But we are a bit confused with the topic of duplicated identifiers. As far as we know, it is not possible to publish a dataset with duplicated identifiers, at least not through the IPT (2.5.4 version), because it will not let you do it. Maybe there is a way to do it directly, but is not something that we can do it easily.
We think the best option is show in the validator that GBIF will not index the dataset with duplicated identifiers, because it can be confusing for publishers that do the process alone.
For example, the validator say it can be indexed but the publisher will not be able to publish the dataset in the IPT because there are occurrenceID that are duplicated, and the validator does not say that neither show the incorrect occurrenceIDs.
We look forward to your answer,
For example, the validator say it can be indexed but the publisher will not be able to publish the dataset in the IPT because there are occurrenceID that are duplicated, and the validator does not say that neither show the incorrect occurrenceIDs.
Agree! Seems like there is somehow a discrepancy between what GBIF allows and what the IPT allows which is kind of confusing but since the IPT is the most common way for data to be shared with GBIF the validator should match the requirements of the IPT from my perspective.
@albenson-usgs @EstebanMH-SiB
I 100% agree. I need to check IPT warnings to understand what is missed. Current validator is new, it was released 3 weeks ago, and it uses different approach, simple it uses production interpretation/indexing and after collect metrics. As a first step I can add a warning for validator, then when we add data preview tab it will be possible to add value of all duplicated identifiers
Validator says file can be indexed although eventIDs aren't unique
The dataset cannot be indexed in UAT. I assume that this is because the eventIDs aren't unique (and possible because there are issues with referential integrity). But the validator doesn't show these issues (presumably because the
coreID
column contains unique values).File validated: edi.916.1.zip
Github user: @ManonGros User: See in registry System: Safari 14.1.1 / Mac OS X 10.15.7 Referer: https://www.gbif.org/tools/data-validator/1635188739201 Window size: width 1371 - height 797 API log&_a=(columns:!(_source),filters:!(),index:'3390a910-fcda-11ea-a9ab-4375f2a9d11c',interval:auto,query:(language:kuery,query:''),sort:!())) Site log&_a=(columns:!(_source),filters:!(),index:'5c73f360-fce3-11ea-a9ab-4375f2a9d11c',interval:auto,query:(language:kuery,query:''),sort:!())) System health at time of feedback: OPERATIONAL