Closed ateucher closed 2 months ago
/cc @jnywong and @jmunroe as well!
This is great @ateucher , thank you! I will wait until the technical comments from the other and then can review for flow/teaching perspective.
Looks great. I think it would be good practice to note who can see what on the buckets -- i.e. can users see contents of each others folders or only those under their user name? Also, probably just a technical footnote, but somewhere it should be documented what the scope of the AWS credentials is (i.e. these aws credentials work only from the openscapes hub).
just dropping a rough R version of this demo, based on co-working:
library(stars)
earthdatalogin::edl_netrc()
url <- "https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSL30.020/HLS.L30.T56JKT.2023246T235950.v2.0/HLS.L30.T56JKT.2023246T235950.v2.0.SAA.tif"
# Cloud native read
r = read_stars(paste0("/vsicurl/", url))
# Write to s3 bucket
write_stars(r, "/vsis3/shared-biodiversity/test2.zarr")
i.e. can users see contents of each others folders or only those under their user name?
yes they can!
Thanks so much @jnywong for running through it and catching that typo. Also thank you for that 2i2c link - that's super helpful!
Thanks @cboettig and @yuvipanda for the point about accessing each other's directories - that is important to know
I'm pushed a few changes based on feedback here and our coworking session this morning.
@mfisher87 suggested using pathlib
instead of os.path.join
to construct paths, but it converts s3://bucketname
to s3:/bucketname
, so I think it doesn't work for URIs. I defaulted to just concatenating strings. Let me know if you think I should go back to using os
.
@betolink let me know if you are successful in writing without tempfiles.
We'll want to get this merged before the cohort call on May 1, can aim to do that after our dry run on the 30th?
Lovely to see the progress on this!
Another thing worth noting is that if users choose to store objects in the $PERSISTENT_BUCKET
, then it is the responsibility of the hub admin and/or hub users to delete objects when no longer needed to minimize cloud billing costs.
@ateucher apologies, I must not have been paying close enough attention! I think string concat is fine. I would not go back to os
as on Windows systems \
will be used to join :) Not a problem on the hub, but it could be confusing if a user attempted to take that technique to their personal machine.
There's also urljoin, but I don't feel strongly about this vs string interpolation/concatenation.
Great job @ateucher et al! Thanks for merging and excited for you to teach today :)
This adds a draft how-to for saving and using data in the S3 buckets
$SCRATCH_BUCKET
and$PERSISTENT_BUCKET
, using Python. It uses a workflow from the clinic to get data, saves a subset to the scratch bucket, and moves it to the persistent bucket.@betolink I would love to have your eyes on this. I am not great at Python so I'm sure my process could be refined, and at the very end I've got a section that doesn't work: Writing from an xarray object with no local source to an S3 bucket without first writing locally. I'm not sure it can be done but would love to be proven wrong.
Note: policies around using this storage haven't been written yet.