add S3 harvesting - Githubissues

Now that S3 exporting is becoming a reality, would be nice to have the inverse ability to harvest from S3 as well.

Most straightforward would be accessing a Spark RDD at s3a://bucket/key. This RDD might include columns such as:

document: XML content
record_id: if present, a pre-made identifier

But, no reason this couldn't accomadate all the fields from Record model? Or, provide a user the opportunity to map what fields to expect in the RDD to fields it will need?

This also get precariously close to static harvests, and suggests a possible improvement to that, that would allow static harvests from S3 more easily. Namely, have a first step for static harvests of XML archives to be the creation of an RDD with document at the bare minimum. Then, as develop record_id and other fields as needed. This would be a handoff to the harvester then that is identical to S3 RDD loaded.

MI-DPLA / combine

add S3 harvesting #372