HDFGroup / h5pyd

h5py distributed - Python client library for HDF Rest API
Other
110 stars 39 forks source link

bulk download suggestions #118

Closed gummadiharsha closed 1 year ago

gummadiharsha commented 2 years ago

I am interested in downloading only a few datasets rather than the entire 1.4TB dataset available at https://registry.opendata.aws/nrel-pds-wtk/

I want only the windspeed_10m, temperature_2m, temperature_10m, pressure_0m and relative humidity_2m for the entire duration (2007 - 2014). Is there an easier way to do this ?

Thanks in advance

jreadey commented 2 years ago

Are you talking about the files here: s3://nrel-pds-wtk/wtk-us/? Total collection looks to be around 25TB.

Easiest approach I'd think would be to do the following:

  1. Setup an EC2 instance in us-west-2 with a large enough disk drive
  2. Copy the first file to the drive
  3. Write a small h5py script that deletes the datasets you don't want
  4. Use the h5repack tool to reduce the file space (deleting hdf5 objects does automatically reduce the file size)
  5. Download the file
  6. Repeat steps 2-5 for each file
jreadey commented 1 year ago

Closing this issue - please reopen if the above approach isn't satisfactory.