fsspec / s3fs

S3 Filesystem
http://s3fs.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
884 stars 273 forks source link

error when trying to copy large files which were written to GCS, s3fs multi-part upload to blame? #628

Open hqm opened 2 years ago

hqm commented 2 years ago

Version of s3fs being used (s3fs --version) s3fs==2021.7.0

Version of fuse being used (pkg-config --modversion fuse, rpm -qi fuse, dpkg -s fuse) Kernel information (uname -r) 5.8.0-44-generic

GNU/Linux Distribution, if applicable (cat /etc/os-release) NAME="Ubuntu" VERSION="20.04.1 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.1 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal

Details about issue My app uses s3fs in python to write moderately large files (12-25 MBytes), using the S3FileSystem client with calls to s3.open(), calls to write(), and then close() the file

The files are written to GCS, but if I try to copy them to another bucket, or within the same bucket, using 'gsutil cp', I get this error

BadRequestException: 400 Rewriting objects created via Multipart Upload is not implemented yet. As a workaround, you can use compose to overwrite the object (by specifying leela-yoyodyne-dev/cameras/M2/A1/2022-06-09/21/2022-06-09T21:39:35.062Z.obj.jsonl as both the source and output of compose) prior to rewrite.

Is there some way to avoid creating multipart files?

I create a client like this self.s3 = S3FileSystem( anon=False, key=access_key_id, secret=secret_access_key, client_kwargs={ 'endpoint_url': endpoint }) Are there some extra options to create the S3FileSystem such that it would not use multipart uploads, to prevent this issue? Or some other workaround to get non-composite objects created ?

martindurant commented 2 years ago

Does cp with gcsfs work for these files?

hqm commented 2 years ago

Does cp with gcsfs work for these files?

How would I test that?

martindurant commented 2 years ago

Sorry, I'm not actually sure whether this is an issue in gcsfs or s3fs. If you are opening with FUSE, then you should be able to use your system's copy (CLI or file browser) within one mount point or multiple. Can you list out exactly the steps you took before getting this error?

hqm commented 2 years ago

Yes here are the steps:

1) We ran our application which wrote out a ~20 MByte file, using our s3 client

        self.fs = S3FileSystem(
            anon=False,
            key=access_key_id,
            secret=secret_access_key,
            client_kwargs={
                'endpoint_url': endpoint
            })

with self.fs.open(path, mode) as f: f.write(my_data)

2) We then tried to use the "gsutil" command line tool supplied by Google Cloud SDK to copy the file from one bucket to another in the cloud

gsutil cp gs://leela-yoyodyne-dev/cameras/M2/A1/2022-06-10/18/2022-06-10T18:50:35.122Z.pose.jsonl gs://leela-manhattan-dev/tmp/foo.jsonl Copying gs://leela-yoyodyne-dev/cameras/M2/A1/2022-06-10/18/2022-06-10T18:50:35.122Z.pose.jsonl... BadRequestException: 400 Rewriting objects created via Multipart Upload is not implemented yet. As a workaround, you can use compose to overwrite the object (by specifying leela-yoyodyne-dev/cameras/M2/A1/2022-06-10/18/2022-06-10T18:50:35.122Z.pose.jsonl as both the source and output of compose) prior to rewrite.

Note however that the gsutil command has no problem copying the file down to my local disk, it it just the cloud-to-cloud copy cases that gives this error

martindurant commented 2 years ago

I am surprised that you can use/are using s3fs to write to GCS. I suppose it's possible. The sister project gcsfs was designed specifically for GCS and should be more feature-full. In either case, s3fs and gcsfs both support a fs.cp method (copy between locations on remote, within or between buckets) and fs.get method (copy from remote to local). That should be all you need. I would have a hard time figuring out why gsutil isn't happy :|

hqm commented 2 years ago

Ah I see. We will try using gcsfs instead. Do you know if it's API is very compatible with the s3fs API?

martindurant commented 2 years ago

Absolutely yes, that's the design. There are some small differences, particularly that credentials are supplied in google json format rather than key/secret.