Now that S3 exporting is becoming a reality, would be nice to have the inverse ability to harvest from S3 as well.
Most straightforward would be accessing a Spark RDD at s3a://bucket/key. This RDD might include columns such as:
document: XML content
record_id: if present, a pre-made identifier
But, no reason this couldn't accomadate all the fields from Record model? Or, provide a user the opportunity to map what fields to expect in the RDD to fields it will need?
This also get precariously close to static harvests, and suggests a possible improvement to that, that would allow static harvests from S3 more easily. Namely, have a first step for static harvests of XML archives to be the creation of an RDD with document at the bare minimum. Then, as develop record_id and other fields as needed. This would be a handoff to the harvester then that is identical to S3 RDD loaded.
Now that S3 exporting is becoming a reality, would be nice to have the inverse ability to harvest from S3 as well.
Most straightforward would be accessing a Spark RDD at
s3a://bucket/key
. This RDD might include columns such as:document
: XML contentrecord_id
: if present, a pre-made identifierBut, no reason this couldn't accomadate all the fields from
Record
model? Or, provide a user the opportunity to map what fields to expect in the RDD to fields it will need?This also get precariously close to static harvests, and suggests a possible improvement to that, that would allow static harvests from S3 more easily. Namely, have a first step for static harvests of XML archives to be the creation of an RDD with
document
at the bare minimum. Then, as developrecord_id
and other fields as needed. This would be a handoff to the harvester then that is identical to S3 RDD loaded.