Pulls from S3 bucket rather than from AWS URL

ecmwf / ecmwf-opendata

A package to download ECMWF open data

Apache License 2.0

184 stars 27 forks source link

Pulls from S3 bucket rather than from AWS URL #35

Open Dadoof opened 9 months ago

Dadoof commented 9 months ago

Hello there folks,

Was making use of this opendata, to get that new 0.25 degree data. I noticed something that I would like investigate.

As is stands now, the tools I see, namely 'client.py', pulls data from a URL. For example, something like: wget https://ecmwf-forecasts.s3.eu-central-1.amazonaws.com/20240227/12z/0p25/enfo/20240227120000-0h-enfo-ef.index

I believe that, from an amazon AWS EC2 instance, this would be a faster pull mechanism: aws s3 cp --no-sign-request s3://ecmwf-forecasts/20240227/12z/0p25/enfo/20240227120000-0h-enfo-ef.index .

Those are command line steps, of course, Inside client.py and such, it would be different tools. My description above was simply to show the difference between a pull via HTTPS and AWS S3.

Any chance of adding capability to pull from the S3 bucket (and thus, the AWS 'backbone') into an AWS EC2 instance, rather than HTTP?

Regards, Brian E.

floriankrb commented 9 months ago

I would assume that a nice pull request to improve this would be welcome.

Two things to check though:

"I believe ... would be faster" : Some benchmarks could be needed, is it really faster ? When ? Where (AWS has different regions) ?
"from an amazon EC2 instance": Checking if the code is running in an EC2 instance should be robust. We do not want to break things elsewhere to support this use case.

Dadoof commented 9 months ago

Hi there,

As for the proper benchmarking, you are correct. For me, anecdotally, it is a good deal quicker. Did this today:

time aws s3 cp --no-sign-request s3://ecmwf-forecasts/20240202/12z/0p25/enfo/20240202120000-0h-enfo-ef.grib2.
real    0m13.074s
user    0m9.010s
sys     0m10.453s

time wget https://ecmwf-forecasts.s3.eu-central-1.amazonaws.com/20240202/12z/0p25/enfo/20240202120000-0h-enfo-ef.grib2
real    1m2.003s
user    0m2.418s
sys     0m5.887s

Indicating that, for this one simple case, the movement from the S3 bucket is a bit quicker (13 seconds vs 1 minute)

As for that EC2 instance: I was merely hoping that would be an option, not that it would replace any other capabilities. That if one wanted to use S3 buckets rather than AWS HTTP sites as the location to get data from, that option would exist.

Regards, Brian E

jvahl commented 9 months ago

To pull files directly from the S3 URI (s3://...), the backend would need to utilize boto3 instead of requests. I think it would be best to start building this capability in the multiurl dependency which executes the downloads.