NASA-Openscapes / earthdata-cloud-cookbook

A tutorial book of workflows for research using NASA EarthData in the Cloud created by the NASA-Openscapes team
https://nasa-openscapes.github.io/earthdata-cloud-cookbook
Other
83 stars 30 forks source link

Add draft howto for S3 storage #325

Closed ateucher closed 2 months ago

ateucher commented 2 months ago

This adds a draft how-to for saving and using data in the S3 buckets $SCRATCH_BUCKET and $PERSISTENT_BUCKET, using Python. It uses a workflow from the clinic to get data, saves a subset to the scratch bucket, and moves it to the persistent bucket.

@betolink I would love to have your eyes on this. I am not great at Python so I'm sure my process could be refined, and at the very end I've got a section that doesn't work: Writing from an xarray object with no local source to an S3 bucket without first writing locally. I'm not sure it can be done but would love to be proven wrong.

Note: policies around using this storage haven't been written yet.

yuvipanda commented 2 months ago

/cc @jnywong and @jmunroe as well!

jules32 commented 2 months ago

This is great @ateucher , thank you! I will wait until the technical comments from the other and then can review for flow/teaching perspective.

cboettig commented 2 months ago

Looks great. I think it would be good practice to note who can see what on the buckets -- i.e. can users see contents of each others folders or only those under their user name? Also, probably just a technical footnote, but somewhere it should be documented what the scope of the AWS credentials is (i.e. these aws credentials work only from the openscapes hub).

cboettig commented 2 months ago

just dropping a rough R version of this demo, based on co-working:

library(stars)
earthdatalogin::edl_netrc()
url <- "https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSL30.020/HLS.L30.T56JKT.2023246T235950.v2.0/HLS.L30.T56JKT.2023246T235950.v2.0.SAA.tif"

# Cloud native read
r = read_stars(paste0("/vsicurl/", url))
# Write to s3 bucket
write_stars(r, "/vsis3/shared-biodiversity/test2.zarr")
yuvipanda commented 2 months ago

i.e. can users see contents of each others folders or only those under their user name?

yes they can!

ateucher commented 2 months ago

Thanks so much @jnywong for running through it and catching that typo. Also thank you for that 2i2c link - that's super helpful!

ateucher commented 2 months ago

Thanks @cboettig and @yuvipanda for the point about accessing each other's directories - that is important to know

ateucher commented 2 months ago

I'm pushed a few changes based on feedback here and our coworking session this morning.

@mfisher87 suggested using pathlib instead of os.path.join to construct paths, but it converts s3://bucketname to s3:/bucketname, so I think it doesn't work for URIs. I defaulted to just concatenating strings. Let me know if you think I should go back to using os.

@betolink let me know if you are successful in writing without tempfiles.

We'll want to get this merged before the cohort call on May 1, can aim to do that after our dry run on the 30th?

jnywong commented 2 months ago

Lovely to see the progress on this!

Another thing worth noting is that if users choose to store objects in the $PERSISTENT_BUCKET, then it is the responsibility of the hub admin and/or hub users to delete objects when no longer needed to minimize cloud billing costs.

mfisher87 commented 2 months ago

@ateucher apologies, I must not have been paying close enough attention! I think string concat is fine. I would not go back to os as on Windows systems \ will be used to join :) Not a problem on the hub, but it could be confusing if a user attempted to take that technique to their personal machine.

There's also urljoin, but I don't feel strongly about this vs string interpolation/concatenation.

jules32 commented 2 months ago

Great job @ateucher et al! Thanks for merging and excited for you to teach today :)