buda-base / drs-deposit

Harvard DRS Deposit base
1 stars 0 forks source link

triggering json export to s3 #41

Closed eroux closed 6 years ago

eroux commented 6 years ago

I think it could be a nice workflow to execute the tojsondimensions.py script on the RS3 server directly, as it would minimize the overhead of sending the METS data to me. It uses the aws credentials of the user (stored in ~/.aws/). Could that be envisioned?

jimk-bdrc commented 6 years ago

Thanks Élie,

This is something that we could run on the machines that are making the batches, as a background task after the batch has been built.

(There are other, less intrusive architectures, such as Some comments:

  1. YES to Python 3. This is what I’m using on Windhorse, the current drs dev environment.
  2. Seems as if the S3 credentials are buried somewhere – in boto3 library, or does boto3 look in ~/.aws?
  3. Which raises the question of dependencies:

a. will the running machine need virtualenv? b. I (or the running background process) need to be in your cool AWS membership group.

  1. It would help if you could add parameter (including the file name!), that will help me write a caller script. A usage() function would be nice.
  2. Can you advise how fast or slow this is? That will help us evaluate whether its no big deal to do it inline, or so expensive that we’d want to do it in the background. background could be done by having the batch builder copy the descriptor.xml to a NAS folder , where a file watching daemon ingests it. In that way, the original file is free to be relocated at will, and the whole system is less dependent on file paths.

Jim Katz

Buddhist Digital Resource Center

mailto:jimk@tbrc.org jimk@tbrc.org

+1 781.254.7537

From: Elie Roux [mailto:notifications@github.com] Sent: Wednesday, February 28, 2018 11:19 AM To: BuddhistDigitalResourceCenter/drs-deposit drs-deposit@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [BuddhistDigitalResourceCenter/drs-deposit] triggering json export to s3 (#41)

I think it could be a nice workflow to execute the tojsondimensions.py https://github.com/BuddhistDigitalResourceCenter/drs-deposit/blob/master/contrib/tojsondimensions.py script on the RS3 server directly, as it would minimize the overhead of sending the METS data to me. It uses the aws credentials of the user (stored in ~/.aws/). Could that be envisioned?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BuddhistDigitalResourceCenter/drs-deposit/issues/41 , or mute the thread https://github.com/notifications/unsubscribe-auth/Af65RQn4vEVVP4hwfPVZ9pPup6H6cAo-ks5tZXppgaJpZM4SW6iz . https://github.com/notifications/beacon/Af65RUYnwjwUgei6URSRwN_uaMCIWLWiks5tZXppgaJpZM4SW6iz.gif

eroux commented 6 years ago
  1. Cool for Python3!
  2. Yes boto3 magically looks at ~/.aws/
  3. a. I've never used a virtualenv, so I'm not sure when it's needed... so I would think we don't need it...
  4. b. I'll send you some aws credentials
  5. sure, I can do that, python has a very civilized library for that purpose, certainly the best I've seen so far
  6. it's very fast, I think it will mostly be limitted by i/o as the xml files are huge and there is some business with distant s3... so CPU shouldn't be the performance bottleneck
xristy commented 6 years ago

Out of curiosity why the need to access S3? The same image set is available locally to Windhorse

eroux commented 6 years ago

Oh, the s3 would be where the output .json file will be uploaded, not from where the images would be read

xristy commented 6 years ago

Ah. I forget these little things

jimk-bdrc commented 6 years ago

I inserted a call to Elie's contrib/tojsondimensions.py into make-drs-batch.sh, which automatically creates the json files and uploads them to aws.