NCEAS / metadig-engine

MetaDig Engine: multi-dialect metadata assessment engine
7 stars 5 forks source link

Allow metadig-engine to access data and metadata directly from the file system #432

Open jeanetteclark opened 6 months ago

jeanetteclark commented 6 months ago

It would be much more efficient to access the data and metadata directly from the file system where possible, especially for data quality checks.

The engine should use the hashstore library to access files directly and pass them to checks.

This change should be compatible with the existing method of getting data/metadata since the engine will not always run on the same machine that data are stored on (eg: ESS-DIVE).

jeanetteclark commented 4 months ago

breaking this down into some steps with implementation options:

  1. get a list of data identifiers for an incoming metadata pid
    • [x] solr query (to be implemented now)
    • [ ] parsing annotations in hashtore (to be added as an alternate implementation later, when this feature exists in hashstore)
  2. pass those pids to the dispatcher where they will be handled
    • the best place to do this is probably runSuite where we can detect if it's a data suite and only make the call to get the pids if needed
jeanetteclark commented 1 month ago

updating this issue - this task got passed off to the check code to handle. metadig-engine will get the list of data pids, and pass them to the check, where the check will use python hashstore to access the data. The only change required of metadig-engine here is that I added a store configuration to metadig.properties, which also gets passed to the check code.

see: