2i2c-org / features

Temporary location for feature requests sent to 2i2c
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Access to buckets on AWS and GCP from local computers #22

Open consideRatio opened 1 year ago

consideRatio commented 1 year ago

We provide storage buckets for users to write to and setup credentials for them within the user servers started on the hubs. But, what if they want to upload something to those hubs from their local computer or similar - then how do they acquire permissions to do so?

@scottyhq has developed scottyhq/jupyter-cloud-scoped-creds but that currently has support for AWS S3 buckets but not for GCP buckets.

Work items

User requests

Maybe related

Related

jbusecke commented 1 year ago

Would this work for someone on a HPC system aswell? If so that might be a solution to the ticket I opened today (not sure how to link those TBH).

consideRatio commented 1 year ago

I lack experience of being on HPC systems, but is the difference between "your computer" and "a hpc system" that you just have terminal access - as compared to the ability to open a browser etc?

Then, yes is the answer you seek I think. You can still extract temporary cloud credentials from a hub at 2i2c, this ought to be independent of where you extract them to. And then, these can be used from a terminal on a HPC system using the aws or gcloud cli, or for example at least google's Python cloud storage client.

jbusecke commented 1 year ago

That sounds good. Is there a preliminary implementation of this?We have a few time sensitive tasks which include some form of "upload from HPC" task. Happy to test drive stuff.

consideRatio commented 1 year ago

Is there a preliminary implementation of this?We have a few time sensitive tasks which include some form of "upload from HPC" task. Happy to test drive stuff.

If you can verify this workflow @jbusecke, it would be helpful!

  1. Install gcloud (google-cloud-sdk) in the user image you use if its not already installed
  2. Start a user server, enter a terminal, and run: gcloud auth print-access-token
  3. Use the generated token like described in https://github.com/scottyhq/jupyter-cloud-scoped-creds/issues/2#issuecomment-1407476199 from a HPC terminal with gcloud storage cp or similar.

Note that the token lasts for one hour, and that if you re-run the print-access-token command, it will rely on a previous cache I think so it will be one hour since initial generation unless you clear the cache from somewhere in the home folder.

jbusecke commented 1 year ago

Just to confirm:

Install gcloud (google-cloud-sdk) in the user image you use if its not already installed

This would be on a running server on the hub? And installation is via these instructions?

yuvipanda commented 1 year ago

@jbusecke yep! But I think on the pangeo-data images, they are probably already installed. They're also available from conda if you prefer https://anaconda.org/conda-forge/google-cloud-sdk

jbusecke commented 1 year ago

But I think on the pangeo-data images, they are probably already installed.

I just tested gcloud --help and got bash: gcloud: command not found. I believe this means I have to install it? Ill try the conda route.

jbusecke commented 1 year ago

Ok here are the steps I took:

  1. Installed google-cloud-sdk on my running server with mamba install google-cloud-sdk
  2. Generated token with gcloud auth print-access-token
  3. Copied that token into a local text file on my laptop token.txt
  4. On my laptop I ran
    
    from google.cloud import storage
    from google.oauth2.credentials import Credentials

import an access token

- option 1: read an access token from a file

with open("token.txt") as f: access_token = f.read().strip()

setup a storage client using credentials

credentials = Credentials(access_token) storage_client = storage.Client(credentials=credentials)

and got this warning:

/Users/juliusbusecke/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/auth/_default.py:83: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK without a quota project. You might receive a "quota exceeded" or "API not enabled" error. We recommend you rerun gcloud auth application-default login and make sure a quota project is added. Or you can use service accounts instead. For more information about service accounts, see https://cloud.google.com/docs/authentication/ warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)

5. I then tried to ls my leap scratch bucket:
```python
# test the storage client by trying to list content in a google storage bucket
bucket_name = "leap-scratch/jbusecke"  # don't include gs:// here
blobs = list(storage_client.list_blobs(bucket_name))
print(len(blobs))

which got me an 404 error

--------------------------------------------------------------------------- NotFound Traceback (most recent call last) Cell In[3], line 3 1 # test the storage client by trying to list content in a google storage bucket 2 bucket_name = "leap-scratch/jbusecke" # don't include gs:// here ----> 3 blobs = list(storage_client.list_blobs(bucket_name)) 4 print(len(blobs)) File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/page_iterator.py:208, in Iterator._items_iter(self) 206 def _items_iter(self): 207 """Iterator for each item returned.""" --> 208 for page in self._page_iter(increment=False): 209 for item in page: 210 self.num_results += 1 File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/page_iterator.py:244, in Iterator._page_iter(self, increment) 232 def _page_iter(self, increment): 233 """Generator of pages of API responses. 234 235 Args: (...) 242 Page: each page of items from the API. 243 """ --> 244 page = self._next_page() 245 while page is not None: 246 self.page_number += 1 File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/page_iterator.py:373, in HTTPIterator._next_page(self) 366 """Get the next page in the iterator. 367 368 Returns: 369 Optional[Page]: The next page in the iterator or :data:`None` if 370 there are no pages left. 371 """ 372 if self._has_next_page(): --> 373 response = self._get_next_page_response() 374 items = response.get(self._items_key, ()) 375 page = Page(self, items, self.item_to_value, raw_page=response) File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/page_iterator.py:432, in HTTPIterator._get_next_page_response(self) 430 params = self._get_query_params() 431 if self._HTTP_METHOD == "GET": --> 432 return self.api_request( 433 method=self._HTTP_METHOD, path=self.path, query_params=params 434 ) 435 elif self._HTTP_METHOD == "POST": 436 return self.api_request( 437 method=self._HTTP_METHOD, path=self.path, data=params 438 ) File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/cloud/storage/_http.py:72, in Connection.api_request(self, *args, **kwargs) 70 if retry: 71 call = retry(call) ---> 72 return call() File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/retry.py:349, in Retry.__call__..retry_wrapped_func(*args, **kwargs) 345 target = functools.partial(func, *args, **kwargs) 346 sleep_generator = exponential_sleep_generator( 347 self._initial, self._maximum, multiplier=self._multiplier 348 ) --> 349 return retry_target( 350 target, 351 self._predicate, 352 sleep_generator, 353 self._timeout, 354 on_error=on_error, 355 ) File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/retry.py:191, in retry_target(target, predicate, sleep_generator, timeout, on_error, **kwargs) 189 for sleep in sleep_generator: 190 try: --> 191 return target() 193 # pylint: disable=broad-except 194 # This function explicitly must deal with broad exceptions. 195 except Exception as exc: File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/cloud/_http/__init__.py:494, in JSONConnection.api_request(self, method, path, query_params, data, content_type, headers, api_base_url, api_version, expect_json, _target_object, timeout, extra_api_info) 482 response = self._make_request( 483 method=method, 484 url=url, (...) 490 extra_api_info=extra_api_info, 491 ) 493 if not 200 <= response.status_code < 300: --> 494 raise exceptions.from_http_response(response) 496 if expect_json and response.content: 497 return response.json() NotFound: 404 GET https://storage.googleapis.com/storage/v1/b/leap-scratch/jbusecke/o?projection=noAcl&prettyPrint=false: Not Found

Am I using the url path wrong here?

yuvipanda commented 1 year ago

@jbusecke try the bucket name as just leap-scratch?

yuvipanda commented 1 year ago

I think you can also use the environment variable CLOUDSDK_AUTH_ACCESS_TOKEN, and then use regular gsutil commands to access storage.

jbusecke commented 1 year ago

@jbusecke try the bucket name as just leap-scratch?

Yay! That worked.

use the environment variable CLOUDSDK_AUTH_ACCESS_TOKEN,

As in exporting that on my local machine?

I suppose that for many of the workflows we would want to have a notebook/script on the HPC cluster which creates an xarray object from e.g. many netcdfs and then write a zarr store directly to the bucket (unless this is not a recommended workflow). Is there a way to use this token with gcsfs? I just tried naively:

fs = gcsfs.GCSFileSystem(token=access_token)

which errors with

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[4], line 1
----> 1 fs = gcsfs.GCSFileSystem(token=access_token)

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/fsspec/spec.py:76, in _Cached.__call__(cls, *args, **kwargs)
     74     return cls._cache[token]
     75 else:
---> 76     obj = super().__call__(*args, **kwargs)
     77     # Setting _fs_token here causes some static linters to complain.
     78     obj._fs_token_ = token

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/gcsfs/core.py:305, in GCSFileSystem.__init__(self, project, access, token, block_size, consistency, cache_timeout, secure_serialize, check_connection, requests_timeout, requester_pays, asynchronous, session_kwargs, loop, timeout, endpoint_url, default_location, version_aware, **kwargs)
    299 if check_connection:
    300     warnings.warn(
    301         "The `check_connection` argument is deprecated and will be removed in a future release.",
    302         DeprecationWarning,
    303     )
--> 305 self.credentials = GoogleCredentials(project, access, token)

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/gcsfs/credentials.py:50, in GoogleCredentials.__init__(self, project, access, token, check_credentials)
     48 self.lock = threading.Lock()
     49 self.token = token
---> 50 self.connect(method=token)
     52 if check_credentials:
     53     warnings.warn(
     54         "The `check_credentials` argument is deprecated and will be removed in a future release.",
     55         DeprecationWarning,
     56     )

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/gcsfs/credentials.py:226, in GoogleCredentials.connect(self, method)
    207 """
    208 Establish session token. A new token will be requested if the current
    209 one is within 100s of expiry.
   (...)
    215     If None, will try sequence of methods.
    216 """
    217 if method not in [
    218     "google_default",
    219     "cache",
   (...)
    224     None,
    225 ]:
--> 226     self._connect_token(method)
    227 elif method is None:
    228     for meth in ["google_default", "cache", "cloud", "anon"]:

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/gcsfs/credentials.py:147, in GoogleCredentials._connect_token(self, token)
    145 if isinstance(token, str):
    146     if not os.path.exists(token):
--> 147         raise FileNotFoundError(token)
    148     try:
    149         # is this a "service" token?
    150         self._connect_service(token)

FileNotFoundError: 

and then prints the token 😱, which is not ideal

yuvipanda commented 1 year ago

Looking at https://gcsfs.readthedocs.io/en/latest/#credentials, looks like you can pass the Credentials object with the token in it rather than the string.

jbusecke commented 1 year ago

Amazing. To wrap up what I did: Steps 1-4 as above.

Then

import gcsfs
import xarray as xr
fs = gcsfs.GCSFileSystem(token=credentials)
ds = xr.DataArray([1]).to_dataset(name='test')
mapper = fs.get_mapper('leap-scratch/jbusecke/test_offsite_upload.zarr')
ds.to_zarr(mapper)

and I confirmed that the zarr array was written:

image
jbusecke commented 1 year ago

This is awesome! Thanks.

I will try this tomorrow with a collaborator. One last question. The collaborator should extract the token from their account, correct?

jbusecke commented 1 year ago

I anticipate the 1 hour limit to become a bottleneck for larger datasets in the future. If that could be relaxed somehow in the future I believe that would be very useful.

consideRatio commented 1 year ago

I anticipate the 1 hour limit to become a bottleneck for larger datasets in the future. If that could be relaxed somehow in the future I believe that would be very useful.

I'm not confident you get shut down if the token expires, the token can't be checket at every byte sent etc - so when is it checked? Is it checked in between each object uploaded by for example gsutil, or between each request made?

@jbusecke if you come to practical conclusions about this, thats also very relevant to capture in documentation! I think its likeley that if a very large object is being copied, that large object gets copied all the way even if it takes 2 hours.

jbusecke commented 1 year ago

Also is it better to not assign the token to a variable for security reasons?

E.g.

with open("token.txt") as f:
    # setup a storage client using credentials
    credentials = Credentials(f.read().strip())

Then again this only lives for 1 hour, so the risk is not particularly high I guess.

Another comment re security: I noticed that with this credential I can also delete files. I did fs.rm('leap-scratch/jbusecke/test_offsite_upload.zarr', recursive=True). Wondering if there is a way for write/read only permissions to avoid mishaps for novel users.

jbusecke commented 1 year ago

I think its likeley that if a very large object is being copied, that large object gets copied all the way even if it takes 2 hours.

Do you think this is also valid if the zarr store is written in many small chunks (~100-200MB) in a streaming fashion rather than uploading a large gzip file? I guess this will be a good test to perform.

@jbusecke if you come to practical conclusions about this, thats also very relevant to capture in documentation!

I will absolutely write up some docs once we have prototyped this. I assume this should go into the 2i2c docs (with leap linking there from our docs?)

consideRatio commented 1 year ago

@jbusecke I'd like this issue to stay scoped to how to extract short lived credentials matching those provided to you as a user on the user server provided by the hub.

For another, one can consider if its feasible for 2i2c to help provide read-only credentials to a few users and read/write to others, but its additional unrelated customizations on the credentials provided in the first place to the user server.


I will absolutely write up some docs once we have prototyped this. I assume this should go into the 2i2c docs (with leap linking there from our docs?)

I'd like these docs to live in scottyhq/jupyter-cloud-scoped-creds as a project, without assumptions of coupling to 2i2c or similar. I've proposed its a project that we help get into the jupyterhub github org in the long run etc also.

Do you think this is also valid [...]

No clue!

jbusecke commented 1 year ago

Sounds good @consideRatio. Ill report back how our testing goes tomorrow.

jbusecke commented 1 year ago

Hey everyone, @jerrylin96 and I have successfully uploaded a test dataset from HPC to the persistent bucket according to the steps outlined above. 🎉

But I think that my suspicion about the short validity of the token

I think its likeley that if a very large object is being copied, that large object gets copied all the way even if it takes 2 hours.

Do you think this is also valid if the zarr store is written in many small chunks (~100-200MB) in a streaming fashion rather than uploading a large gzip file? I guess this will be a good test to perform.

turnes out to be a problem here. @jerrylin96 got an Invalid Credentials 401 Error after about ~1 hr of uploading.

image

I suspect every chunk written requires a valid authentication and thus most of the datasets we are (and will be) using would require an access token that is valid for a longer time.

@consideRatio is it possible to configure the time the token is valid?

yuvipanda commented 1 year ago

I don't really think there's a way to make that token have a longer duration.

I think instead, we should make a separate service account and try to securely provide credentials for that.

consideRatio commented 1 year ago

:+1: on adding a specific service account to address this short term at least.

I remember reading about ways of getting longer durations for the tokens either for AWS and/or GCP, but that it required cloud configuration to allow for it combined with explicit configuration in the request.

jbusecke commented 1 year ago

Thanks for the update!

Getting this figured out would unlock a bunch of people here at LEAP to upload and share datasets with the project (this will definitely also accelerate science) and is thus very high on my internal priorities list.

If there is any way I can help with this, please let me know.