documentcloud / cloud-crowd

Parallel Processing for the Rest of Us
https://github.com/documentcloud/cloud-crowd/wiki
MIT License
851 stars 92 forks source link

Authenticated inputs #2

Closed delagoya closed 14 years ago

delagoya commented 15 years ago

Right now CloudCrowd::Action assumes that the input resource is either a local file or accessible via a simple unauthenticated get request. There is a good case for needing to grab files from private buckets.

Stop-gap is to provide pre-authenticated URLs for the input, but it would be better to "do it right" especially since the save() call authenticates with S3 credentials...

jashkenas commented 15 years ago

I'm not so sure that we want to assume that your Application and your CloudCrowd installation share S3 credentials. Your application might want to serve authenticated URLs directly, it might want to point (potentially) anywhere on the web... Pre-authenticated URLs seem like the way to go -- and you can pass them through an HTTPS request to CloudCrowd for real security. Each side, the application server and CloudCrowd, is responsible for controlling access to its own content, and delivering accessible URLs to the other.

However, if there's a really clean way to make direct S3 happen, I'm all for it.

delagoya commented 15 years ago

Good points. I thought of using a config option (:use_asset_store_for_input = true ) and/or a "s3://bucket/file" url that would look in the S3 AssetStore for the input files, but I think both of these are not stellar options.

Closing the issue, as I think you are right that this stretches the bounds of the application.

jashkenas commented 15 years ago

Ok, but an "s3://" protocol sounds pretty cool. It would probably be against all specs and sensibility to add a top-level protocol like that, but semantically it makes sense: The file:// protocol only works when you're on the same filesystem, and breaks otherwise. The s3:// protocol could only work if you're sharing access credentials, and break otherwise. Something to consider adding.

delagoya commented 15 years ago

OK, I'll take a stab at implementing it tomorrow.

jashkenas commented 15 years ago

Here's some precedent for the notion:

http://p.eligrey.com/

delagoya commented 15 years ago

Do you want to abstract this out to any type of data store? E.g. have the protocol be

store://

and the AssetStore classes implement a get() method:

get(url, local_path) 
jashkenas commented 15 years ago

That looks really, really nice. It would be totally seamless, and actions could return "store://" urls as intermediate results for further processing. We need to think about what would happen when you call save(). Should it return an authenticated URL, or a store:// URL? How to you tell?

The other thing to think about is the protocol prefix. "store://" is nice and short, but I'm not sure I'd know what it meant if I wasn't familiar with it (it might look like a shortcut to Amazon). Maybe we should do "cloudcrowd://", if you need an AssetStore implementation to handle it, or maybe we should just YAGNI and go with "s3://" until we have another backend that needs custom protocols. I'm torn.

(Sorry it took a little while to post this, with Github down).

delagoya commented 15 years ago

Also torn.

Question: would you still use file:// as an input URL even when the AssetStore is FileSystem? E.g. is it ever the case that you want to pull from the worker's local file system as well as push files to S3? If the answer is "yes" then let's YAGNI and just use s3://

Last question, should an exception be thrown (or mark the job as failed) if no S3 credentials are supplied in the config when a worker sees s3 inputs?

jashkenas commented 15 years ago

I think that the answer is yes if you're using some sort of distributed filesystem backend (like a shared EBS under NFS). That seems like it would be a popular option, being arguably faster than S3. In that case we'd need to make LOCAL_STORAGE_PATH configurable (you know what -- I'll just add that in a minute), and the existing FilesystemStore would do the trick. So, let's go with s3://regular/public/url...

For the last question, I'd throw a custom exception (add it to exceptions.rb), something like S3NotConfigured, which will in turn mark the work unit (and the job) as failed, all by itself.

delagoya commented 15 years ago

OK, the repos seems to be a fast moving target, so I am going to create a branch "s3_inputs" to implement this and send you the merge request later today. Should only take me a few minutes to do.

Will add a wiki page for defining inputs which you can keep separate or merge into the job_api page as you like.

delagoya commented 14 years ago

Closing this issue. I think pre-authenticated URLs are the way to go