Eventual Consistency : Creating many small files fails

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?

Run attached script. A variant that uses "of=$DN/$M" instead of ">$DN/$M"
also fails.

What is the expected output?

The script should run to completion without failure.

What do you see instead?

The attached script has in about a dozen runs failed every time, sometimes
after creating thousands of files, sometimes after only a few tens. The
error message is:

./s3test.sh: line 14: /mnt/s3/tw-msg/0/0/28: No such file or directory

With the "of=$DN/$M" variant the error message is:

dd: writing `/mnt/s3/tw-msg/2/7/29': Bad file descriptor
dd: closing output file `/mnt/s3/tw-msg/2/7/29': Bad file descriptor

What version of the product are you using? On what operating system?

s3fs r185
Ubuntu 9.04 (jaunty)
Slightly modified ami-0d729464 (alestic/ubuntu-9.04-jaunty-base-20090614)

Please provide any additional information below.

root@ip-10-251-43-240:~/s3fs# ./s3test.sh
[lines removed for brevity]
dd count=1 bs=5K if=/dev/urandom >/mnt/s3/tw-msg/0/0/26
1+0 records in
1+0 records out
5120 bytes (5.1 kB) copied, 0.139045 s, 36.8 kB/s
dd count=1 bs=5K if=/dev/urandom >/mnt/s3/tw-msg/0/0/27
1+0 records in
1+0 records out
5120 bytes (5.1 kB) copied, 0.074198 s, 69.0 kB/s
dd count=1 bs=5K if=/dev/urandom >/mnt/s3/tw-msg/0/0/28
./s3test.sh: line 14: /mnt/s3/tw-msg/0/0/28: No such file or directory

Last lines in /var/log/syslog:

Jul 22 03:15:02 (none) s3fs: upload path=/0/0/26 size=0
Jul 22 03:15:02 (none) s3fs: upload path=/0/0/26 size=5120
Jul 22 03:15:02 (none) s3fs: upload path=/0/0/27 size=0
Jul 22 03:15:02 (none) s3fs: upload path=/0/0/27 size=5120

No mention of /0/0/28.

Original issue reported on code.google.com by pgarp...@gmail.com on 22 Jul 2009 at 6:37

Attachments:

s3test.sh

GoogleCodeExporter commented 8 years ago

I followed the flow of logic in s3fs.cpp and could see the libcurl call made by
s3fs_mknod returns status 200, i.e. the object is created, but when the error 
occurs
it's because the subsequent libcurl call made by s3fs_getattr (checking 
existence of
the object) returns status 404.

Note these two operations (and more) are triggered by a single dd command.

Sure enough, running "ls /mnt/s3/tw-msg/0/0/28" moments later works fine. The 
file
exists, but is empty.

I suspect a race condition caused by S3 propagation delay, where the S3 server
handling the s3fs_getattr's HEAD request is (occasionally) not yet aware that 
the
object was created by s3fs_mknod's PUT request to a different server just a few
milliseconds earlier.

Here's a brief discussion of S3 propagation delay:
http://developer.amazonwebservices.com/connect/message.jspa?messageID=106354

Most discussion of S3 propagation delay concerns modifications made to existing
objects. I'm only speculating that this issue also applies to the creation of 
new
objects.

Original comment by pgarp...@gmail.com on 22 Jul 2009 at 7:03

GoogleCodeExporter commented 8 years ago

I have just now seen the "of=$DN/$M" variant also report the "No such file or
directory" error. I have not analyzed the s3fs call sequence for this variant, 
it may
even be the same as the ">$DN/$M" used in the attached script. The different 
error
messages may just reflect a 404 being reported for the newly created object in 
the
various libcurl requests following s3fs_mknod's PUT. The error I scrutinized and
described above was a 404 in the first s3fs_getattr. I made a naive attempt to
tolerate the error by adding a loop to s3fs_mknod calling s3fs_getattr to wait 
for
the object to exist. Even so the very next call to s3fs_getattr (in the normal
sequence of creating a file) occasionally produces a 404. This is another 
reason I
suspect S3 internal propagation delay.

Possibly related issues:

http://code.google.com/p/s3fs/issues/detail?id=44

http://code.google.com/p/s3fs/issues/detail?id=47
mitchell.penrod: "The issue seems to be with s3fs returning 0 as the st_mode 
when the
file has no mode set via the amz-meta headers and when the content-type is 
blank."

Based on mitchell.penrod's statement I added a ".jpg" extension to the 
filenames.
That run created over 24000 files before failing with the "No such file or 
directory"
error. Perhaps just a coincidence.

Original comment by pgarp...@gmail.com on 22 Jul 2009 at 6:11

GoogleCodeExporter commented 8 years ago

Hi- indeed, amazon s3 eventual consistency is undoubtedly what you're running 
into
here; myself and other users have seen this before

Original comment by rri...@gmail.com on 22 Jul 2009 at 7:08

GoogleCodeExporter commented 8 years ago

I see.

http://www.google.com/search?q=site%3Adeveloper.amazonwebservices.com+s3+eventua
l+consistency
returns this:

http://developer.amazonwebservices.com/connect/message.jspa?messageID=38373

Colin asks: "1. Assume an object X does not exist. If I PUT X and then GET X, 
am I
guaranteed to get X back instead of a 404 error?"

Ami@AWS answers: "no".

Also this:

http://developer.amazonwebservices.com/connect/click.jspa?searchID=-1&messageID=
104149

"endpoints that have incorrectly reported a 404 (over ~5m after the PUT)"

Yikes. I can work around this by retrying create/write in the application layer 
until
there is no error, and retrying at read time will also be necessary because 
even if
the PUT and a few GETs succeed there is no telling when S3 will eventually be
consistent for all subsequent GETs.

This S3 characteristic largely defeats the purpose of using FUSE to make it 
look like
an ordinary file system. I'm afraid most programmers are not aware of this 
issue, or
assume s3fs deals with it somehow. Not to criticize you, because I don't see 
any way
s3fs could deal with it, but perhaps you could mention this at
http://code.google.com/p/s3fs/wiki/FuseOverAmazon under Limitations. Something 
like:

Due to S3's "eventual consistency" limitations file creation can and will
occasionally fail. Even after a successful create subsequent reads can fail for 
an
indeterminate time, even after one or more successful reads. Create and read 
enough
files and you will eventually encounter this failure. This is not a flaw in 
s3fs and
it is not something a FUSE wrapper like s3fs can work around. The retries 
option does
not address this issue. Your application must either tolerate or compensate for 
these
failures, for example by retrying creates or reads. For details see
http://code.google.com/p/s3fs/issues/detail?id=61

Just a suggestion. Thanks Randy.

  Paul Gardner

Original comment by pgarp...@gmail.com on 22 Jul 2009 at 9:43

GoogleCodeExporter commented 8 years ago

Hi Paul,

I do see this with the script you provided. It would be nice to find a
way to mitigate this, even if it's a hit on performance -- personally, I'm
more interested in a reliable system vs. one that is fast.

What about implementation of a semaphoring system within the application?
For example, upon creation of the object, lock access to it until it actual
appears.

...so after the mknod, loop on a read until it returns success (or times out).

Dan

Original comment by dmoore4...@gmail.com on 21 Dec 2010 at 9:43

Changed title: Eventual Consistency : Creating many small files fails

GoogleCodeExporter commented 8 years ago

"loop on a read until it returns success"

The problem is that when you talk to S3 you're talking to a distributed system. 
You can make a read request that gets routed to a particular server, which 
returns success, and then at some indeterminate amount of time (dt) in the 
future make another request, which invisibly to you gets routed to a different 
server, which has not yet heard of your new object and returns failure.

There is no limit on dt. In practice 99% of your objects may be fully 
propagated and consistently readable 100msec later, 99.9% after 1sec, 99.99% 
after 10sec, and so on. There is no way to know definitively when S3 has 
finally got all it's servers into a consistent state.

What would solve the problem is a call that reports whether a given object is 
fully propagated. You could then loop on that after creating an object.

Original comment by pgarp...@gmail.com on 22 Dec 2010 at 1:43

GoogleCodeExporter commented 8 years ago

Gotcha, that clears things up.  Virtually nothing we can do other than put in a 
(relatively) long wait.

I agree with your recommendation, let users be aware....

Original comment by dmoore4...@gmail.com on 22 Dec 2010 at 2:18

GoogleCodeExporter commented 8 years ago

This article is intriguing:

http://shlomoswidler.com/2009/12/read-after-write-consistency-in-amazon.html

It refers to a new AWS S3 feature (Dec 9, 2010) of "read-after-write 
consistency" for new objects.

Original comment by dmoore4...@gmail.com on 22 Dec 2010 at 11:44

GoogleCodeExporter commented 8 years ago

Swidler writes: "Read-after-write consistency for AWS S3 is only available in 
the US-west and EU regions, not the US-Standard region."

If so, then the script I provided above should not fail using s3fs as-is, as 
long as the bucket being used was created in the US-west or EU region.

Original comment by pgarp...@gmail.com on 23 Dec 2010 at 12:24

GoogleCodeExporter commented 8 years ago

We'll see. I just created a US-west bucket and am running your script.

% date
Wed Dec 22 18:19:41 MST 2010

To note, the eventual consistency issue is not totally mitigated. A 
"read-after-delete" is not guaranteed to return a "not found" message, as it 
should.

Original comment by dmoore4...@gmail.com on 23 Dec 2010 at 1:24

GoogleCodeExporter commented 8 years ago

s3test.sh ended up error'ing out after writing 24,500+ files, but not due to a 
"file not found" but due to too many retries upon a network timeout.

I'm pretty much convinced. The US-west bucket doesn't have the 
"read-after-write" issue.

Created a Wiki and included info on the main page addressing this issue.

Original comment by dmoore4...@gmail.com on 24 Dec 2010 at 12:22

Changed state: Done

GoogleCodeExporter commented 8 years ago

I'm sure this has been seen, but the official FAQ lists read-after-write only 
available in certain zones, and not in all US zones

From: 
http://aws.amazon.com/s3/faqs/#What_data_consistency_model_does_Amazon_S3_employ

Q: What data consistency model does Amazon S3 employ?

    Amazon S3 buckets in the US West (Northern California), EU (Ireland), Asia Pacific (Singapore), and Asia Pacific (Tokyo) Regions provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES. Amazon S3 buckets in the US Standard Region provide eventual consistency.

Original comment by digital...@gmail.com on 5 May 2011 at 2:05

GoogleCodeExporter commented 8 years ago

Perhaps it would be good for s3fs to offer a caching mechanism so that when a 
file is added to a mounted directory, it is cached locally and returned for a 
short period of time to guarantee S3 is in a consistent state.

Original comment by fisher1...@gmail.com on 20 Mar 2012 at 5:42

derrickchoi / s3fs

Eventual Consistency : Creating many small files fails #61