Open consideRatio opened 1 year ago
Would this work for someone on a HPC system aswell? If so that might be a solution to the ticket I opened today (not sure how to link those TBH).
I lack experience of being on HPC systems, but is the difference between "your computer" and "a hpc system" that you just have terminal access - as compared to the ability to open a browser etc?
Then, yes is the answer you seek I think. You can still extract temporary cloud credentials from a hub at 2i2c, this ought to be independent of where you extract them to. And then, these can be used from a terminal on a HPC system using the aws
or gcloud
cli, or for example at least google's Python cloud storage client.
That sounds good. Is there a preliminary implementation of this?We have a few time sensitive tasks which include some form of "upload from HPC" task. Happy to test drive stuff.
Is there a preliminary implementation of this?We have a few time sensitive tasks which include some form of "upload from HPC" task. Happy to test drive stuff.
If you can verify this workflow @jbusecke, it would be helpful!
gcloud
(google-cloud-sdk) in the user image you use if its not already installedgcloud auth print-access-token
gcloud storage cp
or similar.Note that the token lasts for one hour, and that if you re-run the print-access-token command, it will rely on a previous cache I think so it will be one hour since initial generation unless you clear the cache from somewhere in the home folder.
Just to confirm:
Install gcloud (google-cloud-sdk) in the user image you use if its not already installed
This would be on a running server on the hub? And installation is via these instructions?
@jbusecke yep! But I think on the pangeo-data images, they are probably already installed. They're also available from conda if you prefer https://anaconda.org/conda-forge/google-cloud-sdk
But I think on the pangeo-data images, they are probably already installed.
I just tested gcloud --help
and got bash: gcloud: command not found
. I believe this means I have to install it? Ill try the conda route.
Ok here are the steps I took:
mamba install google-cloud-sdk
gcloud auth print-access-token
token.txt
from google.cloud import storage
from google.oauth2.credentials import Credentials
with open("token.txt") as f: access_token = f.read().strip()
credentials = Credentials(access_token) storage_client = storage.Client(credentials=credentials)
and got this warning:
/Users/juliusbusecke/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/auth/_default.py:83: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK without a quota project. You might receive a "quota exceeded" or "API not enabled" error. We recommend you rerun gcloud auth application-default login
and make sure a quota project is added. Or you can use service accounts instead. For more information about service accounts, see https://cloud.google.com/docs/authentication/
warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
5. I then tried to ls my leap scratch bucket:
```python
# test the storage client by trying to list content in a google storage bucket
bucket_name = "leap-scratch/jbusecke" # don't include gs:// here
blobs = list(storage_client.list_blobs(bucket_name))
print(len(blobs))
which got me an 404 error
Am I using the url path wrong here?
@jbusecke try the bucket name as just leap-scratch
?
I think you can also use the environment variable CLOUDSDK_AUTH_ACCESS_TOKEN
, and then use regular gsutil commands to access storage.
@jbusecke try the bucket name as just leap-scratch?
Yay! That worked.
use the environment variable CLOUDSDK_AUTH_ACCESS_TOKEN,
As in exporting that on my local machine?
I suppose that for many of the workflows we would want to have a notebook/script on the HPC cluster which creates an xarray object from e.g. many netcdfs and then write a zarr store directly to the bucket (unless this is not a recommended workflow). Is there a way to use this token with gcsfs? I just tried naively:
fs = gcsfs.GCSFileSystem(token=access_token)
which errors with
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[4], line 1
----> 1 fs = gcsfs.GCSFileSystem(token=access_token)
File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/fsspec/spec.py:76, in _Cached.__call__(cls, *args, **kwargs)
74 return cls._cache[token]
75 else:
---> 76 obj = super().__call__(*args, **kwargs)
77 # Setting _fs_token here causes some static linters to complain.
78 obj._fs_token_ = token
File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/gcsfs/core.py:305, in GCSFileSystem.__init__(self, project, access, token, block_size, consistency, cache_timeout, secure_serialize, check_connection, requests_timeout, requester_pays, asynchronous, session_kwargs, loop, timeout, endpoint_url, default_location, version_aware, **kwargs)
299 if check_connection:
300 warnings.warn(
301 "The `check_connection` argument is deprecated and will be removed in a future release.",
302 DeprecationWarning,
303 )
--> 305 self.credentials = GoogleCredentials(project, access, token)
File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/gcsfs/credentials.py:50, in GoogleCredentials.__init__(self, project, access, token, check_credentials)
48 self.lock = threading.Lock()
49 self.token = token
---> 50 self.connect(method=token)
52 if check_credentials:
53 warnings.warn(
54 "The `check_credentials` argument is deprecated and will be removed in a future release.",
55 DeprecationWarning,
56 )
File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/gcsfs/credentials.py:226, in GoogleCredentials.connect(self, method)
207 """
208 Establish session token. A new token will be requested if the current
209 one is within 100s of expiry.
(...)
215 If None, will try sequence of methods.
216 """
217 if method not in [
218 "google_default",
219 "cache",
(...)
224 None,
225 ]:
--> 226 self._connect_token(method)
227 elif method is None:
228 for meth in ["google_default", "cache", "cloud", "anon"]:
File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/gcsfs/credentials.py:147, in GoogleCredentials._connect_token(self, token)
145 if isinstance(token, str):
146 if not os.path.exists(token):
--> 147 raise FileNotFoundError(token)
148 try:
149 # is this a "service" token?
150 self._connect_service(token)
FileNotFoundError:
and then prints the token 😱, which is not ideal
Looking at https://gcsfs.readthedocs.io/en/latest/#credentials, looks like you can pass the Credentials
object with the token in it rather than the string.
Amazing. To wrap up what I did: Steps 1-4 as above.
Then
import gcsfs
import xarray as xr
fs = gcsfs.GCSFileSystem(token=credentials)
ds = xr.DataArray([1]).to_dataset(name='test')
mapper = fs.get_mapper('leap-scratch/jbusecke/test_offsite_upload.zarr')
ds.to_zarr(mapper)
and I confirmed that the zarr array was written:
This is awesome! Thanks.
I will try this tomorrow with a collaborator. One last question. The collaborator should extract the token from their account, correct?
I anticipate the 1 hour limit to become a bottleneck for larger datasets in the future. If that could be relaxed somehow in the future I believe that would be very useful.
I anticipate the 1 hour limit to become a bottleneck for larger datasets in the future. If that could be relaxed somehow in the future I believe that would be very useful.
I'm not confident you get shut down if the token expires, the token can't be checket at every byte sent etc - so when is it checked? Is it checked in between each object uploaded by for example gsutil
, or between each request made?
@jbusecke if you come to practical conclusions about this, thats also very relevant to capture in documentation! I think its likeley that if a very large object is being copied, that large object gets copied all the way even if it takes 2 hours.
Also is it better to not assign the token to a variable for security reasons?
E.g.
with open("token.txt") as f:
# setup a storage client using credentials
credentials = Credentials(f.read().strip())
Then again this only lives for 1 hour, so the risk is not particularly high I guess.
Another comment re security: I noticed that with this credential I can also delete files. I did fs.rm('leap-scratch/jbusecke/test_offsite_upload.zarr', recursive=True)
. Wondering if there is a way for write/read only permissions to avoid mishaps for novel users.
I think its likeley that if a very large object is being copied, that large object gets copied all the way even if it takes 2 hours.
Do you think this is also valid if the zarr store is written in many small chunks (~100-200MB) in a streaming fashion rather than uploading a large gzip file? I guess this will be a good test to perform.
@jbusecke if you come to practical conclusions about this, thats also very relevant to capture in documentation!
I will absolutely write up some docs once we have prototyped this. I assume this should go into the 2i2c docs (with leap linking there from our docs?)
@jbusecke I'd like this issue to stay scoped to how to extract short lived credentials matching those provided to you as a user on the user server provided by the hub.
For another, one can consider if its feasible for 2i2c to help provide read-only credentials to a few users and read/write to others, but its additional unrelated customizations on the credentials provided in the first place to the user server.
I will absolutely write up some docs once we have prototyped this. I assume this should go into the 2i2c docs (with leap linking there from our docs?)
I'd like these docs to live in scottyhq/jupyter-cloud-scoped-creds as a project, without assumptions of coupling to 2i2c or similar. I've proposed its a project that we help get into the jupyterhub github org in the long run etc also.
Do you think this is also valid [...]
No clue!
Sounds good @consideRatio. Ill report back how our testing goes tomorrow.
Hey everyone, @jerrylin96 and I have successfully uploaded a test dataset from HPC to the persistent bucket according to the steps outlined above. 🎉
But I think that my suspicion about the short validity of the token
I think its likeley that if a very large object is being copied, that large object gets copied all the way even if it takes 2 hours.
Do you think this is also valid if the zarr store is written in many small chunks (~100-200MB) in a streaming fashion rather than uploading a large gzip file? I guess this will be a good test to perform.
turnes out to be a problem here. @jerrylin96 got an Invalid Credentials 401
Error after about ~1 hr of uploading.
I suspect every chunk written requires a valid authentication and thus most of the datasets we are (and will be) using would require an access token that is valid for a longer time.
@consideRatio is it possible to configure the time the token is valid?
I don't really think there's a way to make that token have a longer duration.
I think instead, we should make a separate service account and try to securely provide credentials for that.
:+1: on adding a specific service account to address this short term at least.
I remember reading about ways of getting longer durations for the tokens either for AWS and/or GCP, but that it required cloud configuration to allow for it combined with explicit configuration in the request.
Thanks for the update!
Getting this figured out would unlock a bunch of people here at LEAP to upload and share datasets with the project (this will definitely also accelerate science) and is thus very high on my internal priorities list.
If there is any way I can help with this, please let me know.
We provide storage buckets for users to write to and setup credentials for them within the user servers started on the hubs. But, what if they want to upload something to those hubs from their local computer or similar - then how do they acquire permissions to do so?
@scottyhq has developed scottyhq/jupyter-cloud-scoped-creds but that currently has support for AWS S3 buckets but not for GCP buckets.
Work items
User requests
Maybe related
Related