fsspec / s3fs

S3 Filesystem
http://s3fs.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
878 stars 272 forks source link

Moving file atomic/single operation way? #104

Closed fsck-mount closed 7 years ago

fsck-mount commented 7 years ago

I'm using s3fs and fastparquet to write parquet files to s3. I've configured presto to read from s3 using hive external table.

Problem here is, presto will read from the file where fast parquet is writing, so it is failing saying invalid parquet file. To outcome this problem, I'll be writing to a temporary path, lets say, i'm supposed to write to

filename = 'bucket_name/account_type/yr=2017/mn=10/dt=8/19/de86d8ed-7447-420f-9f25-799412e377adparquet.json'
# let's write to temp file
tmp_file = filename.replace('account_type', 'tmp-account_type')
fastparquet.write(filename, df, open_with=opener)
fs.mv(tmp_file, fllename)

But even in this case, it looks like sometimes, rarely presto is reading incomplete file. How's this possible? How can we make this atomic/isolated with s3fs?

martindurant commented 7 years ago

Although maned after the posix mv command, for S3, this is actually copy-and-delete. How this is implemented internally within S3, I don't know, but it does not in general promise immediate consistency, so I am not totally surprised that either presto is reading the file before it is all available, or possibly that mv is copying a file which is not yet available. I am open to suggestions, but I can't suggest more than to add sleep statements into your workflow.

As a side note, what does presto offer to you that you are not able to accomplish in python, since the data already passes through the memory of your machine(s) as pandas dataframes ?

fsck-mount commented 7 years ago

@martindurant thanks for suggestion. But I couldn't able to understand the usage of sleep here.

Coming to presto part, we have clickstream data, which flows into our system. We will be writing to JSON files and then converting them to parquet and then will store them in s3. So, we use presto heavily to run multiple queries on previous data along with current flowing data.

When we query existing data, there won't be a problem. But if we query the current hour data, then it might cause problem and it happens a bit rarely. Predicting this is almost impossible.

martindurant commented 7 years ago

Just in case calling mv too soon after writing has an effect, I would put a time.sleep() between the two calls. The probability of this fixing things isn't enormous, but it's worth a go as the simplest thing. Are you certain that the files are indeed complete and valid? We have seen with GCS cases where some files were truncated at times of heavy usage.

fsck-mount commented 7 years ago

May be, I guess so. I'll try keeping sleep and check once.

As per my understanding and simple test, the files are complete. As soon as presto throws the error, I checked with fastparquet which is able to read the file. So, I guess presto is failing exactly at the time of copying the file.

martindurant commented 7 years ago

How big are the files? The REST documentation suggests behaviour may be different above 5GB than below http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectCOPY.html .

fsck-mount commented 7 years ago

No, the files are hardly around 500 Mb (JSON File). To be precise, it will never cross 490 Mb on disk and 50 mb (Parquet File) in in s3 (After converting to Parquet file)

fsck-mount commented 7 years ago

@martindurant I believe we are doing copy_object only when we are using s3fs mv. For reference: presto-groups

martindurant commented 7 years ago

Yes, we are calling the server's copy object command, not downloading and rewriting the data - that would be very expensive!

fsck-mount commented 7 years ago

@martindurant I'm little bit confused with block_size.

Looks like our file uploads all are multi parts upload. If it is the case, the error seems quite natural as, at some point of time there is chance for partial data to present.

Correct me, if I'm wrong.

martindurant commented 7 years ago

Correct that they are multi-part uploads, but the final key should not be created until the multi-part-upload is finalised.

fsck-mount commented 7 years ago

Yes, you are right. Just ran a simple test, the file is not created unless the upload is done. Not sure what is causing this issue.

fsck-mount commented 7 years ago

@martindurant what do you mean by upload is finalised.

Because if you remember I've raised issue in fast parquet. I am just trying to understand what is the meaning of finalised, and how the invalid parquet file (s3 Key) is created when the process is killed (signal 9) due to memory issue. I trying to figure out is there any relation between this invalid parquet file, final key creation and current issue I have.

fsck-mount commented 7 years ago

@martindurant

Thanks for your time and patience. I'm overwriting the existing files in s3 wen running one minute cron job. This overwriting is causing the issue when reading from presto. I think we can close this issue. Would you like to share your inputs in avoiding s3 consistency when overwriting ?

martindurant commented 7 years ago

Here is the reference: http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel

Amazon claims never to give partial or corrupted data, you either get the old version or the new. That could be enough to break presto, if the versions are not compatible. Another failure mode would be one action to get the file-list, then the next to download data, but the file size of the new file is different from the old one. If you write your own code, you can check the generation of a key to make sure it hasn't changed, or download a specific generation (old data may still be available), or be sure to match each file of a batch by time-stamp. I cannot, however, give any advice on how you might implement any of this for presto, sorry.