casangi / xradio

Xarray Radio Astronomy Data IO
Other
9 stars 5 forks source link

Update processing set to handle S3 bucket URIs #146

Closed amcnicho closed 2 months ago

amcnicho commented 4 months ago

Changes to the visibility read and load functions for #140

CLAassistant commented 4 months ago

CLA assistant check
All committers have signed the CLA.

amcnicho commented 3 months ago

As long as AWS credentials are configured (documented here and here), after the refactoring of #149 the following already works:

import s3fs
from xradio._utils.zarr.common import _open_dataset

s3 = s3fs.S3FileSystem(anon=False, requester_pays=False)

for processing_set in s3.listdir("s3://viper-test-data/Antennae_North.cal.lsrk.split.vis.zarr/"):
    for msv4 in s3.listdir(processing_set["name"]):

        # note that prepending the store arg is necessary to avoid `FileNotFoundError`
        # otherwise it's assumed they're stored locally on disk

        test_output = _open_dataset("s3://"+msv4["name"])

This because zarr can interpret S3 bucket URLs using its FSStore wrapper around the general MutableMapping interface exposed by fsspec, which should be accepted as a dependency.

amcnicho commented 3 months ago

There is a new test added for the case of S3 reads, and that's what failed in the checks on this branch. I thought I'd set up the test dataset bucket as completely public, but clearly that configuration (or the initialization of s3fs) is insufficient to yield a passing test. We'll need to consider how to securely provide AWS credentials to the service account running tests, and then generalize to a coherent policy of support for S3 bucket access.

Jan-Willem commented 3 months ago

1.) On my jwsteeb AWS account, I can see public buckets but when I try: https://us-east-1.console.aws.amazon.com/s3/buckets/viper-test-data?region=us-east-1&bucketType=general&tab=objects it says Insufficient permissions to list objects.

2.) In read_processing_set I removed the try except PermissionError and s3 = s3fs.S3FileSystem(anon=False, requester_pays=False) works. This also seems to indicate that the bucket is not public.

3.) In test_s3_read_processing_set why is "s3://viper-test-data/Antennae_North.cal.lsrk.split.ms" using .ms instead of .vis.zarr?

Jan-Willem commented 3 months ago

I have also merged the latest version of main into this branch. This has no impact on the code of this pull request.

amcnicho commented 3 months ago

Although the objects in the bucket containing the test dataset had been set to fully public, it was necessary to add a policy granting list access to the "Everyone" category (this can also be done via ACL if those are enabled). We've done that, so that addresses your points 1 and 2. Point 3 was a typo that snuck in from local testing, which I have corrected. At this point the new test still doesn't pass when run by the pytest service account in the test action container.

The main issue to resolve is that botocore.exceptions.NoCredentialsError: Unable to locate credentials raises when the S3FileSystem is created using anon=True. I wouldn't expect boto to require credentials at all in this case, but even when I provide those (via keyword argument, environment variable, or passing in an existing session to bypass client configuration from the s3fs library) it still complains.

amcnicho commented 3 months ago

Adding a GetObject policy to all objects in the zarr dataset bucket was also necessary in order to avoid Permission and NoCredential Errors from s3fs/boto. After additional refactoring to force the use of signature_version=UNSIGNED (for the anon=True case), the new test of anonymous public reads passes CI tests, and local testing with AWS credentials available still work as expected. Unless there are other tests desired, this seems ready :shipit: