euroargodev / argopy

A python library for Argo data beginners and experts
https://argopy.readthedocs.io
European Union Public License 1.2
176 stars 38 forks source link

S3 support #326

Closed gmaze closed 1 month ago

gmaze commented 7 months ago

The Argo ADMT is experiencing with Amazon S3 in order to move the GDAC infrastructure into the cloud. In order to prepare argopy for this and to be able to access and test the AWS prototype server, we need to develop support for S3. This would require:

A new data fetcher shall be developed in another PR

gmaze commented 4 months ago

@tcarval is there any reasons for not having the gz index files on s3 ? https://argo-gdac-sandbox.s3.eu-west-3.amazonaws.com/pub/index.html#pub/idx/

tcarval commented 3 months ago

@tcarval is there any reasons for not having the gz index files on s3 ? https://argo-gdac-sandbox.s3.eu-west-3.amazonaws.com/pub/index.html#pub/idx/

I am adding the gz indexes (the synchronization gdac - aws is underway)

gmaze commented 3 months ago

New IndexStore ready to work with AWS S3 core index file

from argopy import ArgoIndex
idx = ArgoIndex(host='s3://argo-gdac-sandbox/pub/idx').load()
idx.search_wmo_cyc(6903091, 1)

poke @tcarval

gmaze commented 1 month ago

Problem

On github actions, when unit testing the new s3 store, we fall back on an anonymous requests with the following client:

import boto3
from botocore import UNSIGNED
from botocore.client import Config

fs = boto3.client('s3', config=Config(signature_version=UNSIGNED))

but tests fails with the error:

botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the SelectObjectContent operation: Access Denied

I don't get why we can't run anonymously a select_object_content on the bucket

poke: @quai20 @tcarval

MORE:

fs = boto3.client('s3', config=Config(signature_version=UNSIGNED))
fs._request_signer._credentials is None

returns True, as it should

object_list = fs.list_objects_v2(Bucket='argo-gdac-sandbox', Prefix="pub/idx/argo_synthetic-profile_index.txt.gz")
object_list

returns:

{'ResponseMetadata': {'RequestId': 'PMVAY0JH3KRP8J3Y',
  'HostId': 'xqACBxsLPkqHm1VEPccv0zsceMm7s3cn5i5mey6Wd0yIHdTED8UbGA+ZGe0pLxiPnJLWaT3goIo=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'xqACBxsLPkqHm1VEPccv0zsceMm7s3cn5i5mey6Wd0yIHdTED8UbGA+ZGe0pLxiPnJLWaT3goIo=',
   'x-amz-request-id': 'PMVAY0JH3KRP8J3Y',
   'date': 'Tue, 09 Jul 2024 20:20:54 GMT',
   'x-amz-bucket-region': 'eu-west-3',
   'content-type': 'application/xml',
   'transfer-encoding': 'chunked',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'IsTruncated': False,
 'Contents': [{'Key': 'pub/idx/argo_synthetic-profile_index.txt.gz',
   'LastModified': datetime.datetime(2024, 7, 9, 3, 0, 5, tzinfo=tzutc()),
   'ETag': '"cc0d89c9dbda566cb9a29085b55d3a5a"',
   'Size': 6232628,
   'StorageClass': 'STANDARD'}],
 'Name': 'argo-gdac-sandbox',
 'Prefix': 'pub/idx/argo_synthetic-profile_index.txt.gz',
 'MaxKeys': 1000,
 'EncodingType': 'url',
 'KeyCount': 1}

Update

Because describing your problem is always already partly solving it !

The bucket can be read anonymously, but it is the SelectObjectContent method thats requires credentials !

https://docs.aws.amazon.com/sdkfornet/v3/apidocs/items/S3/MS3SelectObjectContentSelectObjectContentRequest.html

Permissions You must have the s3:GetObject permission for this operation. Amazon S3 Select does not support anonymous access. For more information about permissions, see Specifying Permissions in a Policy in the Amazon S3 User Guide.

gmaze commented 1 month ago

Looking for a solution on the test repo here: https://github.com/gmaze/ga_aws_access