gbif / portal-feedback

User feedback for the GBIF API, website and published data. You can ask questions here. 🗨❓
30 stars 16 forks source link

Can't handle DwC-As with measurementOrFact files #5313

Closed gbif-portal closed 6 months ago

gbif-portal commented 6 months ago

Can't handle DwC-As with measurementOrFact files

We've noticed a problem that when we attempt to publish a Darwin Core Archive that contains a measurementOrFact extension, the dataset is rejected by your system (e.g., https://logs.gbif.org/app/discover#/?_g=(filters:!(),refreshInterval:(display:On,pause:!f,value:0),time:(from:now-1y,to:now))&_a=(columns:!(_source),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'439da4d0-290a-11ed-8155-a37cb1ead50e',key:level,negate:!f,params:(query:ERROR),type:phrase),query:(match_phrase:(level:ERROR)))),index:'439da4d0-290a-11ed-8155-a37cb1ead50e',interval:auto,query:(language:lucene,query:'datasetKey.keyword:%22b0515413-6d32-490a-83a0-f8c08f002c70%22%20AND%20attempt:%22167%22'),sort:!('@timestamp',desc))).

Could the system just ignore this file instead of rejecting the whole archive? Several of our portals now have this extension.


Github user: @themerekat User: See in registry - Send email System: Chrome 124.0.0 / Windows 10.0.0 Referer: https://www.gbif.org/dataset/b0515413-6d32-490a-83a0-f8c08f002c70 Window size: width 1536 - height 703 API log&_a=(columns:!(_source),filters:!(),index:'3390a910-fcda-11ea-a9ab-4375f2a9d11c',interval:auto,query:(language:kuery,query:''),sort:!())) Site log&_a=(columns:!(_source),filters:!(),index:'5c73f360-fce3-11ea-a9ab-4375f2a9d11c',interval:auto,query:(language:kuery,query:''),sort:!())) System health at time of feedback: INFO

CecSve commented 6 months ago

Could it be a simple misspelling of the extension perhaps?

DwC-A data file »measurementOrFact.csv« does not exist at

It should be measurementOrFacts.csv

themerekat commented 6 months ago

Hm...wouldn't it need to be "measurementsOrFacts" if that was the case?

themerekat commented 6 months ago

Looks like Darwin Core kind of has it both ways 😅 image

CecSve commented 6 months ago

Looks like Darwin Core kind of has it both ways 😅 image

Jeez. Well, it looks like you had it right to begin with: https://github.com/gbif/rs.gbif.org/blob/master/extension/measurements_or_facts_2024-02-19.xml and https://dwc.tdwg.org/terms/#measurementorfact.

Ok, back to the drawing board then. I will investigate.

CecSve commented 6 months ago

I downloaded the most recent endpoint and I cannot see that the file exist in the archive, however, you have added the extendedMeasurementOrFact to your meta.xml file. We use the meta.xml file to validate the content of the archive (see the message: Exception caught during metasyncing DwC-A [b0515413-6d32-490a-83a0-f8c08f002c70], and service crawler-dwca-metasync) which then throws the error in stack_trace org.gbif.dwc.UnsupportedArchiveException: DwC-A data file »measurementOrFact.csv« does not exist.

If the meta.xml of the different archives keeps referring to such a file, whether there will be content in it or not, the file should be included in the archive. Does it make sense?

themerekat commented 6 months ago

Hm, so we have "extendedMeasurementOrFact" in the meta, but only "measurementOrFact" in the actual archive?

CecSve commented 6 months ago

Hm, so we have "extendedMeasurementOrFact" in the meta, but only "measurementOrFact" in the actual archive?

I do not see either extension file in the archive, only the extendedMeasurementOrFact information in meta.

themerekat commented 6 months ago

Hm, so we have "extendedMeasurementOrFact" in the meta, but only "measurementOrFact" in the actual archive?

I do not see either extension file in the archive, only the extendedMeasurementOrFact information in meta.

Aha, thanks for the clarification! It does appear that our batch process is not including the measurementOrFact file, but our single-publishing process is. I'll look into this! Feel free to close this issue.