Consolidate S3/offline file storage

trevorgerhardt commented 6 years ago

Initial implementation of consolidating our S3/file storage usage. Please give feedback on naming and implementation. I wanted to do a first pass that changed the least amount of code possible that could be updated later. Ideally we'd have an implementation we could initialize and pass between our of our libraries so that it's used the same way.

A few notes:

Ideally this would be initialized in the AnalysisServer and passed to what needs it, but our current style is having a lot of "static" instances so I went with that.
SeamlessCensus will still need to be separate. In OFFLINE mode we might want to just turn it off. Also may want to refactor it into a completely separate service that can be called with bounds, an AWS key, and a bucket/folder to dump into.
We should switch to a single bucket for everything. The only opposing reason I can think of is if we are creating automated AWS processes that get initiated when things are uploaded to a bucket (which would be counter to a true OFFLINE mode)

trevorgerhardt commented 6 years ago

@ansoncfit definitely went light on the comments expecting things to change drastically.

Ideally we'd share an instance of this type of class across all libraries so that files are stored and retrieved in the same way, using the same S3 client. So eliminate and replace what is currently in R5.
Are we monitoring buckets well enough right now for that part to matter? A single bucket setup would have top level directories within it separating data like opportunities and networks, so I'm not sure if that matters as much. It would be easier to abstract the persistence if we used a single bucket.

abyrd commented 6 years ago

Replying to @trevorgerhardt's response to @ansoncfit's review:

Agreed that we should put this storage mechanism somewhere that would allow a single instance to be used. It hasn't been around long enough to merit maintaining another separate project. R5 needs to use it when it's running as a worker, separately from analysis-backend. So I don't think this can be in analysis-backend, it needs to be in R5.
I talked to Anson about this the other day. We thought it could be useful to have a single configurable bucket as long as all resources are split into several sub-buckets with predefined names. You mention that "a single bucket setup would have top level directories" so it sounds like you intended the same thing. Seems like we all agree on this but that wasn't yet stated. The main problem is migration - it's bound to cause at least some short term mess to change the bucket hierarchy.

abyrd commented 6 years ago

One more observation: When switching to a single top-level bucket, we could use a CATEGORY enum to force everything to be in standard locations (rather than open-ended sub-bucket strings).

abyrd commented 6 years ago

I'm thinking we can't resolve all the issues here in a single pull request. We should probably copy any relevant observations in the comments and PR reviews over to #85 to track long term progress on this issue.

abyrd commented 6 years ago

Was just noticing that R5 also uses S3 storage (fetching scenarios and networks) and would ideally share components to handle this. We may want to (eventually) move these components into R5 since analysis-backend imports R5 as a dependency.

conveyal / analysis-backend

Consolidate S3/offline file storage #162