gbif / portal-feedback

User feedback for the GBIF API, website and published data. You can ask questions here. 🗨❓
30 stars 16 forks source link

Validator says file can be indexed although eventIDs aren't unique #3766

Open gbif-portal opened 2 years ago

gbif-portal commented 2 years ago

Validator says file can be indexed although eventIDs aren't unique

The dataset cannot be indexed in UAT. I assume that this is because the eventIDs aren't unique (and possible because there are issues with referential integrity). But the validator doesn't show these issues (presumably because the coreID column contains unique values).

File validated: edi.916.1.zip


Github user: @ManonGros User: See in registry System: Safari 14.1.1 / Mac OS X 10.15.7 Referer: https://www.gbif.org/tools/data-validator/1635188739201 Window size: width 1371 - height 797 API log&_a=(columns:!(_source),filters:!(),index:'3390a910-fcda-11ea-a9ab-4375f2a9d11c',interval:auto,query:(language:kuery,query:''),sort:!())) Site log&_a=(columns:!(_source),filters:!(),index:'5c73f360-fce3-11ea-a9ab-4375f2a9d11c',interval:auto,query:(language:kuery,query:''),sort:!())) System health at time of feedback: OPERATIONAL

EstebanMH-SiB commented 2 years ago

I do not know if this have been reported before, so sorry if you are already aware. I just want to say the same problem happens with occurrence records, the validator does not show problems when there are duplicate occurrenceIDs. Here is the link of the validation https://www.gbif.org/tools/data-validator/dbd36711-0d0f-4259-af3f-84b196811135 and the data UNAL_UNABx3_20220223_Revision.xlsx

muttcg commented 2 years ago

@EstebanMH-SiB I think it is possible to highlight that there are duplicates as a warning, but we don't decline data because it has some duplicated identifiers

You can see number for the validation Number of lines 1,340 Number of records indexed 60

Where numbers of lines - actual number of lines in the file Number of records indexed - number of records you will see on the portal

I created issue here https://github.com/gbif/pipelines/issues/679

muttcg commented 2 years ago

As a real example eBird for 2020 had: Records in the file: 741,037,572 Records in the index : 705,008,469

So ~36mil records in the file were duplicated

eBird indexing history

EstebanMH-SiB commented 2 years ago

Thanks for the fast answer @muttcg. We agree with the warning, that will be useful.

But we are a bit confused with the topic of duplicated identifiers. As far as we know, it is not possible to publish a dataset with duplicated identifiers, at least not through the IPT (2.5.4 version), because it will not let you do it. Maybe there is a way to do it directly, but is not something that we can do it easily.

We think the best option is show in the validator that GBIF will not index the dataset with duplicated identifiers, because it can be confusing for publishers that do the process alone.

For example, the validator say it can be indexed but the publisher will not be able to publish the dataset in the IPT because there are occurrenceID that are duplicated, and the validator does not say that neither show the incorrect occurrenceIDs.

We look forward to your answer,

albenson-usgs commented 2 years ago

For example, the validator say it can be indexed but the publisher will not be able to publish the dataset in the IPT because there are occurrenceID that are duplicated, and the validator does not say that neither show the incorrect occurrenceIDs.

Agree! Seems like there is somehow a discrepancy between what GBIF allows and what the IPT allows which is kind of confusing but since the IPT is the most common way for data to be shared with GBIF the validator should match the requirements of the IPT from my perspective.

muttcg commented 2 years ago

@albenson-usgs @EstebanMH-SiB

I 100% agree. I need to check IPT warnings to understand what is missed. Current validator is new, it was released 3 weeks ago, and it uses different approach, simple it uses production interpretation/indexing and after collect metrics. As a first step I can add a warning for validator, then when we add data preview tab it will be possible to add value of all duplicated identifiers