MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

Script based harvesting #163

Open ghukill opened 6 years ago

ghukill commented 6 years ago

Looking at Supplejack's features, noticed a blurb about scripting the ingest of materials. Would be interesting to see how this might be possible with Combine.

Ingest (== harvesting) is currently only OAI and static files, but what if it supported raw python code, where it must just return the string of the Record metadata and, optionally, a record_id?

It would probably be difficult, if not outright a bad idea, to have this work with Spark. But it could run as a background task.

ghukill commented 6 years ago

Thinking more on this, can see where harvesting from SQL or Mongo might be a helpful feature, and would be relativively painful to provide parameters and handle this in Spark.

Script-based harvesting is still tricky. It would be nice to provide a place where users could script unusual or unique harvests (e.g. from an API), but it would be hard to run this efficiently in spark.

An obvious approach would be to have users write pyspark code. These could be saved as files, and uploaded, then called by some boilerplate harvesting code like other Jobs:

from user_generated_file import custom_harvest_class
results = custom_harvest(job_id, etc, etc).harvest()

where custom_harvest_class is based on a known structure for custom harvesting.