OregonDigital / OD2

Next generation of Oregon Digital ( https://oregondigital.org ) digital collections platform, built on Samvera Hyrax ( https://github.com/samvera/hyrax/ )
18 stars 1 forks source link

WC5 - Investigate using Lambda for derivative generation #524

Closed decimalator closed 5 years ago

jechols commented 5 years ago

I think this ticket needs a lot more info from somebody. Are we willing to throw away / override the entire Hyrax core derivative logic to make this happen? (Personally, I am, but do I speak for the group?) How important is it to get this figured out in WC5? How important is it to make this happen even if it doesn't make it for WC5? (Should the issue topic specify the work cycle?)

Here's what I do know:

On ingest, the code assumes that derivative generation is a background job that's run from the Hyrax stack. There's definitely some research we'd need to figure out how to throw out those jobs and just fire off a lambda command. And of course we'd need to be able to view the results of whatever lambda things happen - in case derivatives fail or get stuck or whatever.

It seems like there's no direct information stored on an asset to tell it about its derivatives - it's just determined by code. This is good news in that we can generate derivatives however we want so long as we store items in S3 in a way that's easy to find using just an asset's id.

Generating derivatives, fortunately, is trivial: I can produce images' and PDFs' derivatives with a very simple bash script. But image derivatives need graphicsmagick and openjpeg tools. PDF derivatives have even more dependencies. I have done zero investigation in terms of what can be installed on a lambda server, so I'm really not sure about our options.

jechols commented 5 years ago

I have done a lot of digging and tinkering around to see if there's some way we can make this work, but I don't think it's possible. For small images, this wouldn't be an issue, but as images get larger, the RAM and disk needs exceed what lambda can offer. And for video derivatives, I don't see any way we could hope to live within lamdba's constraints.

Investigating this has me a bit worried about the cost of derivatives more generally. In OD's current infrastructure, we have sixteen gigs of RAM dedicated to derivative workers. The vast majority of the time, this RAM goes unused, but when there are audio and video derivative jobs, I'm betting we max out RAM pretty fast. We don't want our new infrastructure to have to have 16 gigs just sitting unused at all times, nor do we want the workers to share resources with the web server. AWS Lambda seemed like a perfect way to mitigate this problem, but it seems infeasible. So....

jechols commented 5 years ago

@decimalator assigning this to you: I think we need somebody more on the ops side to verify that lambda isn't a good fit for this. I have literally never investigated AWS Lambda prior to Tuesday.

straleyb commented 5 years ago

@decimalator Can we close this since we arent using aws?