NASA-Openscapes / earthdata-cloud-cookbook

A tutorial book of workflows for research using NASA EarthData in the Cloud created by the NASA-Openscapes team
https://nasa-openscapes.github.io/earthdata-cloud-cookbook
Other
85 stars 30 forks source link

Guiding the use of HTTPS vs. S3 when working in-region #163

Open asteiker opened 1 year ago

asteiker commented 1 year ago

@ashiklom Brought up such a great question during the ESIP Earthdata Cloud session that I’m paraphrasing, but essentially: Why not simply work with HTTPS links even if you’re in-region, instead of switching over to s3 which comes with so many additional challenges? Some feedback from @bilts:

I think it’s a both/and situation. HTTP could be improved. s3:// gives you some nice things (list operations, guaranteed partial file access, parallelism in access, some straightforward way to mount as a filesystem) and has a lot of traction with things like zarr/xarray

But I’m left wondering what our role is here as far as teaching or promoting this option. I’d love to know more about the pros/cons that Patrick listed off (I’m sure there are many other considerations), as Alexey pointed out, if we can identify where the limitations are (i.e. in the tooling or system itself).

Some ideas:

From @briannapagan:

This is where you would need some benchmarking to explain the use cases for why between the two. Specifically cloud optimized formats, your performance could greatly vary, and should be improved using s3 in-region. Specifically performance for analysis-in-place (i.e. I am going to subset/average any other common operation on some cloud files before (if at all) downloading)

@ashiklom:

I think the most useful thing OpenScapes can do here in the short term is give a high-level survey of available options for s3 vs. http access as well as their pros/cons. Also, check out this Twitter thread: https://twitter.com/charlesstern/status/1574497421245108224?s=20&t=rLvID-0c1j1NxHgy0JOCjQ @cisaacstern@hachyderm.io@cisaacstern@hachyderm.io @charlesstern @ashiklom711 @StellarGeay @ProjectJupyter This is the public http endpoint Pangeo Forge's S3 bucket on @OpenStorageNet. Generic HTTP(S) servers won't be able to handle the scaled parallel reads a Zarr store is designed for. But yes, if the HTTP(S) link points to Zarr on cloud storage this should "just work". TwitterTwitter | Sep 26th, 2022

And this closely-related thread: https://twitter.com/charlesstern/status/1574499938465038336?s=20&t=rLvID-0c1j1NxHgy0JOCjQ @cisaacstern@hachyderm.io@cisaacstern@hachyderm.io @charlesstern @_jhamman @ashiklom711 @StellarGeay @ProjectJupyter Good point! TwitterTwitter | Sep 26th, 2022

See comment in earthaccess issue: https://github.com/nsidc/earthaccess/issues/188#issuecomment-1403789306 Comment on #188 Document why signed S3 URLs might be giving 400s when called from inside us-west-2

@betolink :

Great points, HTTPS could also scale (with some latency as Yuvi pointed out), in this case TEA (thin egress app) is only a proxy for S3. And I agree with [@Brianna Pagán] we need to do some benchmarking to verify how it impacts speed and parallelism

This could be a topic for a future hackday. Further suggestions from @ScienceCat18 :

100% agree we should outline the different access options (https/s3) and their pros & cons (as it's already been said in this thread), and i think key is to do so from an end-user perspective (so high level and not too technical). What are the use cases that are best suited for https, and those for s3. Agreed with updating the cheatsheets with this info as well! +1 to having this as a topic for cookbook hack-day

yuvipanda commented 1 year ago

When accessing NASA S3 hosted data in-region via HTTPS links, you're still actually just using S3! It is done via https://docs.aws.amazon.com/AmazonS3/latest/userguide/ShareObjectPreSignedURL.html automatically for you. There are a couple of automatic steps in between that add latency, but it's not a generic HTTP server sending things to you - it's the exact same serving infrastructure that S3 uses, just with a different authentication mechanism.

So the question really is 'what performance benefit do we get from using direct S3 authentication vs the presigned URLs given to us by earthdata redirects?'. That could be settled by some benchmarking.

My intuition is that we should tell users to use the HTTPS url by default, only switching to s3 in very specific (to be determined) cases. The advantages are:

  1. Users don't need to write different code based on where the code is executing!
  2. Users don't have to write fundamentally different code to access data based on where it lives (S3 vs on-prem)
  3. We stop giving AWS free publicity by spending our resources educating end users on AWS best practices they might not need :)
yuvipanda commented 1 year ago

I do agree that hard data in terms of a quick benchmark would be necessary to move forward here.

yuvipanda commented 1 year ago

At least in pangeo / xarray land, one big problem was that xarray / fsspec did not work with .netrc files for accessing data in the cloud. I've been working deep in the bowels of aiohttp (with this PR: https://github.com/aio-libs/aiohttp/pull/7131) to fix that. Once that is done, the netrc based solution will now work for both on-prem and cloud access, regardless of where the calling code is.