edgi-govdata-archiving / overview

🎈 Start here for current projects, how to get involved, and joining community calls, a resource for new and veteran members
GNU General Public License v3.0
118 stars 20 forks source link

Automate provenance or metadata generation for downloading #33

Closed dcwalk closed 7 years ago

dcwalk commented 7 years ago

From feedback in #30, primarily a concern for people working in in-person events and using the workflow document

Chihacks:

Encourage the tools people or the baggers to create a video on how to run a web crawler and record provenance info

dcwalk commented 7 years ago

@b5 thoughts on this?

dcwalk commented 7 years ago

Copying in @emilymae 's comments

When you're talking about provenance, what sort of metadata / provenance info do you mean exactly? Are you talking about crawlers that generate WARCs or not? I know the DataONE folks have developed the PROVOne data model extending PROV for scientific workflows (see here: http://vcvcomputing.com/provone/provone.html), but not sure how you would implement something like that for crawlers.

birdage commented 7 years ago

Id suggest looking at ISO compliance standards for the data ingest, and data download. I can say from experience that having the correct metadata is key.

https://www.ncddc.noaa.gov/metadata-standards/ https://geo-ide.noaa.gov/wiki/index.php?title=ISO_Data_Quality http://cfconventions.org/

dcwalk commented 7 years ago

This to some degree has been addressed in the Archiver app, when using harvester tools metadata from the previous phases is already created. However a larger discussion around collaborating across metadata standards is emerging (https://github.com/edgi-govdata-archiving/dataset-registries)