fsspec / s3fs

S3 Filesystem
http://s3fs.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
844 stars 268 forks source link

Extra bytes on files uploaded using s3fs.open(..., 'wb') #334

Open Liam3851 opened 4 years ago

Liam3851 commented 4 years ago

I upload about 80 GB of files nightly (about 60-130 MB each) to S3 using S3FS. This week, I've found that about 1% of these files have some amount of extra data-- between about 8k and 15k, based on the size reported by S3 as compared to the size reported by the FTP server. The files are gzips and thus corrupted and unreadable. When this occurs, rerunning the upload again for the same file almost always succeeds and yields valid data.

This is odd because my client code has not changed in 5 months and this issue just started to occur this week. I'm unfortunately not sure whether it's S3FS itself, S3 behaving oddly, some interaction between the two, or something else entirely. Any help would be appreciated, and I'm happy to do some experiments to narrow it down if you can recommend any to try.

As a complication the data I'm moving to S3 is ultimately coming from an FTP server and so I'm moving the data with the following code:

with fs.open(s3_path, 'wb') as fh:
   ftp_conn.retrbinary(f'RETR {remote_path}', fh.write)

After some experimentation, some things I do know fix the problem:

I tested two package configurations and got the same behavior: Config 1: s3fs 0.4.0 boto3 1.10.39 botocore 1.13.39 s3transfer 0.2.1

Config 2: s3fs 0.4.2 boto3 1.14.12 botocore 1.17.12 s3transfer 0.3.3

Again I find this puzzling, I'm not even really sure it's an issue with S3FS since this code has been working so long-- but the fact that cutting S3FS out of the loop eliminates the issue leads me to think it may be related. I'd really appreciate any suggestions for how I can help investigate.

Environment:

martindurant commented 4 years ago

Have you tried ftputil with s3fs by any chance? In theory, you should be able to assert the length or md5 of the uploaded file with S3FileSystem(..., consistency=), which, if things pass, would suggest that s3fs was given extra bytes to write (probably).

Liam3851 commented 4 years ago

Thanks Martin. At your suggestion I tried ftputil + s3fs using the following code. It gave the same results as ftplib + s3fs (some files too big):

import shutil
with ftputil.FTPHost(...) as host:
    with host.open(remote_path, 'rb') as rfh:
        with fs.open(s3_path, 'wb') as wfh:
            shutil.copyfileobj(remote_path, s3_path)

I haven't yet tried the consistency= kwarg. I'll give that a try and see what comes up.

martindurant commented 4 years ago

Shouldn't it be shutil.copyfileobj(rfh, wfh)?

Also, instead of shutil.copyfileobj, you might want to be low level:

while True:
    chunk = rfh.read(10*2**20)
    if not chunk:
        break
    wfh.write(chunk)
Liam3851 commented 4 years ago

Sorry, yes, Typo after the copy.

I'll try the lowlevel instead of shutil. IIRC shutil's chunksize is 16K and ftplib's default block size is just 8K, and I see you're reading 10 MiB chunks so I imagine this will work somewhat differently (perhaps more efficiently).

martindurant commented 3 years ago

Did you manage to dig anything up?