gbif / ipt

GBIF Integrated Publishing Toolkit (IPT)
https://www.gbif.org/ipt
Apache License 2.0
126 stars 57 forks source link

Check for references to core IDs that do not exist #1246

Open kbraak opened 8 years ago

kbraak commented 8 years ago

The IPT should validate that all the core ID used in the extension(s) references a core ID that exists.

Please note, this check can currently be performed by http://tools.gbif.org/dwca-validator/

kbraak commented 8 years ago

Related issue in DwC-A validator: http://dev.gbif.org/issues/browse/TOOL-7

kbraak commented 7 years ago

@cgendreau I anticipate the IPT will do referential integrity checks on DwC-As by making external calls to the GBIF Data Validator API.

For large datasets, it may take hours for the data validator to finish. Therefore, instead of the user having to wait for the results, how about they can have them sent directly to their email? Of course they would have to provide an email in their API request for this to work. Thanks.

cgendreau commented 7 years ago

There is no plan to send a response by e-mail at the moment. But if we were to do that it is very likely that we would use the GBIF login instead.

Running the validation on large dataset won't take hours if we do not interpret all records.

CecSve commented 1 year ago

Relevant issue on portal feedback.

mike-podolskiy90 commented 1 year ago

@CecSve Thank you for the comment. That's going to be a pretty expensive check. We can probably consider validating references for relatively small datasets

CecSve commented 1 year ago

@mike-podolskiy90 it seems like it is the scope of the new data model though (point 5)?

https://github.com/gbif/ipt/issues/1736#issue-1123931961

mike-podolskiy90 commented 1 year ago

Yes, but that is frictionless data package and those checks will be performed by the frictionless library itself

CecSve commented 1 year ago

Yes, but that is frictionless data package and those checks will be performed by the frictionless library itself

Would that mean that the publisher would not get any notification similar to the messages they receive when publishing currently?

mike-podolskiy90 commented 1 year ago

No. Data package would not be generated, and validation errors would be displayed.

CecSve commented 1 year ago

No. Data package would not be generated, and validation errors would be displayed.

Would the checks and validation errors only be for publishers using the frictionless packages? Or is it planned to also have such checks for regular DwC archives?

mike-podolskiy90 commented 1 year ago

It is not planned

CecSve commented 1 year ago

Ok. I will not make a new issue as the origin of this issue is capturing what I would suggest.

Ideally, the IPT should validate referential integrity of DwC-A's to capture mismappings and potentially stop the generation of an archive if the issues are not fixed by the publisher. Relevant issues for inclusion of referential integrity checks are:

https://github.com/gbif/portal-feedback/issues/4522 https://github.com/gbif/portal-feedback/issues/4491 https://github.com/gbif/portal-feedback/issues/3766

@ManonGros please add to this if I am missing something

ManonGros commented 2 months ago

Another issue related to referential integrity: https://github.com/gbif/portal-feedback/issues/5359#issuecomment-2176844133