MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

add S3 harvesting #372

Closed ghukill closed 5 years ago

ghukill commented 5 years ago

Now that S3 exporting is becoming a reality, would be nice to have the inverse ability to harvest from S3 as well.

Most straightforward would be accessing a Spark RDD at s3a://bucket/key. This RDD might include columns such as:

But, no reason this couldn't accomadate all the fields from Record model? Or, provide a user the opportunity to map what fields to expect in the RDD to fields it will need?

This also get precariously close to static harvests, and suggests a possible improvement to that, that would allow static harvests from S3 more easily. Namely, have a first step for static harvests of XML archives to be the creation of an RDD with document at the bare minimum. Then, as develop record_id and other fields as needed. This would be a handoff to the harvester then that is identical to S3 RDD loaded.