programmatic metadata ingest

NCAR / rda-image-archive-dev

Obsolete development code for the RDA image archive. Replaced by `rda-image-archive`.

https://github.com/NCAR/rda-image-archive

MIT License

0 stars 2 forks source link

programmatic metadata ingest #8

Open coltongrainger opened 5 years ago

coltongrainger commented 5 years ago

Importing logbooks could look like:

download
copy to (assign UUID, verify file integrity)
harvest metadata from original archive
update dataset

Eventually, I'd like updates to this prototype to be compatible with methods Zaihua Ji describes here: https://sea.ucar.edu/conference/2012/operational-dataset-update-RDA.

coltongrainger commented 5 years ago

programmatic metadata ingest

coltongrainger commented 5 years ago

I'd like to incorporate updates to metadata from transcription or OCR with this issue.

coltongrainger commented 5 years ago

talking with Philip today, see https://unix.stackexchange.com/questions/2161/rsync-filter-copying-one-pattern-only/2503#2503 for an include / exclude format

coltongrainger commented 5 years ago

I'd like to be prepared to extract metadata from

unnormalized csv files,
normalized json files, and
normalized directory hierarchies.

I'm looking for a data exchange format specification now. (Ideally json before injecting to SQL?)

coltongrainger commented 5 years ago

I wrote out functions for end users to recursively assign uuids and create csv templates here: scripts/2019-06-26-data-exchange-formatting.py.

coltongrainger commented 5 years ago

Sam mentioned that a harvesting metadata from a directory structure, e.g.,

platform
 |
document
 |
image (files)

could be achieved if data from archive is unnormalized and placed redundantly throughout the directories, or if there's a separate archive.csv file against which to make foreign key reference.