Open hqm opened 2 years ago
Does cp
with gcsfs work for these files?
Does
cp
with gcsfs work for these files?
How would I test that?
Sorry, I'm not actually sure whether this is an issue in gcsfs or s3fs. If you are opening with FUSE, then you should be able to use your system's copy (CLI or file browser) within one mount point or multiple. Can you list out exactly the steps you took before getting this error?
Yes here are the steps:
1) We ran our application which wrote out a ~20 MByte file, using our s3 client
self.fs = S3FileSystem(
anon=False,
key=access_key_id,
secret=secret_access_key,
client_kwargs={
'endpoint_url': endpoint
})
with self.fs.open(path, mode) as f: f.write(my_data)
2) We then tried to use the "gsutil" command line tool supplied by Google Cloud SDK to copy the file from one bucket to another in the cloud
gsutil cp gs://leela-yoyodyne-dev/cameras/M2/A1/2022-06-10/18/2022-06-10T18:50:35.122Z.pose.jsonl gs://leela-manhattan-dev/tmp/foo.jsonl Copying gs://leela-yoyodyne-dev/cameras/M2/A1/2022-06-10/18/2022-06-10T18:50:35.122Z.pose.jsonl... BadRequestException: 400 Rewriting objects created via Multipart Upload is not implemented yet. As a workaround, you can use compose to overwrite the object (by specifying leela-yoyodyne-dev/cameras/M2/A1/2022-06-10/18/2022-06-10T18:50:35.122Z.pose.jsonl as both the source and output of compose) prior to rewrite.
Note however that the gsutil command has no problem copying the file down to my local disk, it it just the cloud-to-cloud copy cases that gives this error
I am surprised that you can use/are using s3fs to write to GCS. I suppose it's possible. The sister project gcsfs was designed specifically for GCS and should be more feature-full. In either case, s3fs and gcsfs both support a fs.cp
method (copy between locations on remote, within or between buckets) and fs.get
method (copy from remote to local). That should be all you need. I would have a hard time figuring out why gsutil isn't happy :|
Ah I see. We will try using gcsfs instead. Do you know if it's API is very compatible with the s3fs API?
Absolutely yes, that's the design. There are some small differences, particularly that credentials are supplied in google json format rather than key/secret.
Version of s3fs being used (s3fs --version) s3fs==2021.7.0
Version of fuse being used (pkg-config --modversion fuse, rpm -qi fuse, dpkg -s fuse) Kernel information (uname -r) 5.8.0-44-generic
GNU/Linux Distribution, if applicable (cat /etc/os-release) NAME="Ubuntu" VERSION="20.04.1 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.1 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal
Details about issue My app uses s3fs in python to write moderately large files (12-25 MBytes), using the S3FileSystem client with calls to s3.open(), calls to write(), and then close() the file
The files are written to GCS, but if I try to copy them to another bucket, or within the same bucket, using 'gsutil cp', I get this error
BadRequestException: 400 Rewriting objects created via Multipart Upload is not implemented yet. As a workaround, you can use compose to overwrite the object (by specifying leela-yoyodyne-dev/cameras/M2/A1/2022-06-09/21/2022-06-09T21:39:35.062Z.obj.jsonl as both the source and output of compose) prior to rewrite.
Is there some way to avoid creating multipart files?
I create a client like this self.s3 = S3FileSystem( anon=False, key=access_key_id, secret=secret_access_key, client_kwargs={ 'endpoint_url': endpoint }) Are there some extra options to create the S3FileSystem such that it would not use multipart uploads, to prevent this issue? Or some other workaround to get non-composite objects created ?