DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
896 stars 241 forks source link

AWSJobStore: Retry if version of key is missing #121

Closed hannes-ucsc closed 9 years ago

hannes-ucsc commented 9 years ago

I am a little puzzled as to why we haven't seen this yet. Due to the eventual consistency of S3, there is the chance that a new version of a key is reported as missing on one connection even though another connection has just created it.

Write a test program that creates new versions of a key through one connection and immediately attempts to download that version through another connection. At some point we should get a "No such version" error.

If we can reproduce this behavior, we need to add code to AWSJobStore that handles it.

cket commented 9 years ago

In addition to eventual consistency, S3 guarantees "read-after-write" consistency for some regions- including ours. This means any new object/file is immediately consistent, but changes like updating a file are still eventually consistent.

cket commented 9 years ago

It looks like all of our s3 methods might fall under this category. I don't see any that edit a file on s3, even the update methods create new keys and deletes the old one. My tests in s3 tentatively support this, I haven't yet been able to get any kind of consistency error.

hannes-ucsc commented 9 years ago

Could you cite some of your sources?

hannes-ucsc commented 9 years ago

And I don't think we delete keys or create new ones in the update methods. We simply PUT a new version. I'd really love to know how exactly the consistency constraints relate to versioning.

cket commented 9 years ago

source, with links to amazon announcements: http://shlomoswidler.com/2009/12/read-after-write-consistency-in-amazon.html

Also in this FAQ http://aws.amazon.com/s3/faqs/

The update method uses the upload method which explicitly creates a new key if the file is less than our s3_part_size, which explains why my small tests didn't force consistency errors. If required, it calls initiate_multipart_upload with the key name which may also return a new key, I will investigate further.

As far as versioning, I haven't found anything explicitly about consistency. However, with versioning old objects aren't actually overwritten so that they can be recovered if necessary, which may mean that 'updates' to versioned buckets could possibly be treated as new objects and thus guaranteed by 'read-after-write'. It seems like a long shot, but ill look into it.

hannes-ucsc commented 9 years ago

Having read the material you cited here, I'm inclined to punt this one. I am certain that we will not read stale versions since we are using versioning. At worst we will get a "version does not exist error " which will fail the jobTree. But since we don't know what the exact error codes are, we can't write code to handle them.