cioos-siooc / ckan

CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers datahub.io, catalog.data.gov and europeandataportal.eu/data/en/dataset among many other sites.
http://ckan.org/
Other
2 stars 4 forks source link

clean up and formalize the datastream harvester #161

Closed fostermh closed 1 year ago

fostermh commented 1 year ago

DataStream Metadata Comparison

Issues to resolve:

fostermh commented 1 year ago

we will support automated translation during harvest. need to figure out how to do this without overloading the translation service.

fostermh commented 1 year ago

the automatic translation appears to be working. DataStream XML still fails validation mostly due to the order of fields in the XML. Attached are an original XML file from DataStream and a 'fixed' version that passes validation in Oxygen.

Original (failing ISO validation)

c6e71281-b732-47e7-8c24-f9430d7edf26.iso19115.xml.txt

Corrected (passing ISO validation)

c6e71281-b732-47e7-8c24-f9430d7edf26.iso19115.FIX.xml.txt

willfarrell commented 1 year ago

DataStream v3: We're actively working on the next version of DataStream to address a few items from your list. This includes allowing the entering metadata in english and french, changing organization from a simple text input to a proper object (to remove duplicates, attach RoR/URL and separate out individuals), along with many more improvements.

1-4: Will be supported in v3 5: What does EOV mean? 6: Currently all download links are signed URLs for security. There is a backlog item to allow direct download, currently a request to the API is required. 7: We do collect this, but may not be in the ISO 19115 template. I've created an issue to add this in. 8: We do generate this, but may not be in the ISO 19115 template. I've created an issue to add this in. 10: More organization metadata will be available in v3 11: Individuals will be separated in v3 12: data steward email is a required field, exists for all datasets 13: I've created an issue to address this. 14: I've created an issue to address this. I didn't know order in the xml was important to the standard. Any other validation issues you've seen? Is there an online tool to test this? I'll use your example to apply suggested changes. 15: I've created an issue to address this. 16: I've created an issue to address this. 17: I've created an issue to address this. We'll reach out to group that are affected by this. 18: I've created an issue to fix what I can now, the rest will have to be addressed in v3 19: I've created an issue to fix what I can now, the rest will have to be addressed in v3

I should be able to address the ones I've created an issue for this week.

fostermh commented 1 year ago

I recommend oxygen as a XML validator/editor you can get a free 30 day trial by providing an email address. It will read the namespaces in the XML header and run the appropriate validation for you. no need to download xsd's and such.

EOV stands for 'Essential Ocean Variables' basically it's how we group datasets in CIOOS. You can see them listed on the main page of the national catalogue. We likely need to add more options to capture freshwater sampling. The EOV's are populated by matching keywords.

willfarrell commented 1 year ago

Update: Bunch of fixes were deployed last week. All ISO 19115 files have been rebuilt over the weekend. I just need to run a few through the validator and make any necessary tweaks.

willfarrell commented 1 year ago
  1. Should be good to go
  2. Should be good to go
  3. Should be good to go, spot tested a few - let me know if you find any more
  4. Fixed
  5. Fixed
  6. Still waiting to hear back from data steward on a fix for their license, rest fixed
  7. Should be good to go
  8. Some fixes applied, rest may have to wait for v3

Let me know if there is anything else that I can address in the near term.

fostermh commented 1 year ago

Great! 271 datasets now pass validation and are harvested into my local test environment successfully. I have attached the most recent harvest report. 8 of the sitemap url's appear to be broken which leaves only 12 datasets with outstanding problems. Note that I thought I could split keywords on semicolons during harvest but it's not working out so keywords will need to be formatted as one keyword per xml keyword tag in v3, if possible.

datastream_harvest_job_report.json.txt

willfarrell commented 1 year ago

Thanks, I'll take a look through the list. At first glance it looks like some are due to cache, but do see some new ones.

willfarrell commented 1 year ago

More fixes pushed, hopefully that should do it. I have a fix for the sitemap too, but can't be deployed just yet. The json report was very helpful.

fostermh commented 1 year ago

I think, for now, everything is working as expected. harvester has been deployed to the development servers awaiting review. I'm going to close this issue for now. we can reopen if new problems are found.