NREL / rex

REsource eXtraction Tool (rex)
https://nrel.github.io/rex
BSD 3-Clause "New" or "Revised" License
20 stars 10 forks source link

download binary file #63

Closed ryancoe closed 3 years ago

ryancoe commented 3 years ago

Why this feature is necessary: Is it possible to download entire dataset (e.g., /nrel/nsrdb/v3/nsrdb_2018.h5) as some sort of binary file?

I have considered the following alternatives: I tried using the CLI, but this appears to output csv files.

To give more insight, what I would eventually like to do is work with the data using xarray and dask. Downloading huge csv files is workable, but it seems like there is probably a better way that I'm unaware of.

Thanks very much ;)

MRossol commented 3 years ago

@ryancoe If you are interested in entire datasets I'd recommend that you download the entire source file from S3: s3://nrel-pds-nsrdb/v3/nsrdb_2018.h5

Our HSDS service doesn't have the bandwidth to extract an entire dataset with is ~71 GB.

ryancoe commented 3 years ago

Thanks @MRossol! I was new to AWS, but found the following to work fine (posting here in case someone else finds it useful):

  1. Download and install the AWS CLI - no need to configure (https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html)
  2. Look at the AWS Open Data registry to see what's available/directory structure (e.g., for the NSRDB: https://registry.opendata.aws/nrel-pds-nsrdb/)
  3. List directory contents, e.g., aws s3 ls s3://nrel-pds-nsrdb/v3/puerto_rico/ --no-sign-request --human-readable
  4. Download the files you desire, e.g., aws s3 cp s3://nrel-pds-nsrdb/v3/puerto_rico/nsrdb_puerto_rico_2017.h5 --no-sign-request ~/Downloads/

One further question if you don't mind: It appears the full h5 files (e.g., s3://nrel-pds-nsrdb/v3/nsrdb_1998.h) are actually ~1.5 TB. I guess what I was really hoping to do with my original question was download a subset of the data I'm interested in using rex (probably via the Python interface) and save that as an h5 file. I believe this might go something like:

from rex import NSRDBX

nsrdb_file = '/nrel/nsrdb/v3/nsrdb_2018.h5'
with NSRDBX(nsrdb_file, hsds=True) as f:
    ghi = f.get_region_df('ghi', 'California') # this puts the data in memory as a Pandas DataFrame
    ghi.to_netcdf('savepath.h5') # this is the part that I'm missing

    # alternatively, it would be nice to just directly save the data to disk without putting the whole thing in memory
    f.get_region_df('ghi', 'California').to_netcdf('savepath.h5') # something like this?
MRossol commented 3 years ago

@ryancoe, do you need all of the variables for the given region or just 1? If it's just one, we normally just dump the table produced by get_region_df to a .csv file. If you would like all of the variables (or a subset of them) I can try to find time to add in a new method to dump a region to a new .h5 file.

ryancoe commented 3 years ago

@MRossol - Thanks very much for you quick reply and help! I think that I only want a subset of the variables, so if it is easy to create a method for dumping that to an h5, that'd be really cool!

MRossol commented 3 years ago

@ryancoe see v0.2.39 specifically the new save_region module

Here is the test for an example of how to implement it: https://github.com/NREL/rex/blob/ecaca4cc311a1410fcece785c075a3362d64a6e8/tests/test_resource_extraction.py#L752-L753

ryancoe commented 3 years ago

@MRossol - First off, thanks so much for your help with this, I really appreciate it! I tried running the following based on your test. I hope that I'm not missing something silly, but that's definitely a possibility.

from rex import WindX
import os
import tempfile

region = 'Providence'
region_col = 'county'

datasets = ['windspeed_100m', None, ['windspeed_100m']]

wtk_file = '/nrel/wtk/conus/wtk_conus_2014.h5'
with WindX(wtk_file, hsds=True) as f:
    for dset in datasets:
        print(dset)
        with tempfile.TemporaryDirectory() as td:
            out_path = os.path.join(td, 'test.h5')
            try:
                f.save_region(out_path, region, datasets=None,
                                  region_col=region_col)
            except Exception as e:
                print(e)

As you can see, I was trying a couple of ways for setting the datasets argument, but all of these return similar errors

Traceback (most recent call last):
  File "test_save_region.py", line 19, in <module>
    f.save_region(out_path, region, datasets=None,
  File "test_save_region.py", line 19, in <module>
    f.save_region(out_path, region, datasets=None,
  File "/Users/rcoe/anaconda3/envs/waveResource/lib/python3.8/site-packages/rex/resource_extraction/resource_extraction.py", line 935, in save_region
    data = ds[gids]
  File "/Users/rcoe/.local/lib/python3.8/site-packages/h5pyd/_hl/dataset.py", line 729, in __getitem__
    selection = sel.select(self, args)
  File "/Users/rcoe/.local/lib/python3.8/site-packages/h5pyd/_hl/selections.py", line 88, in select
    sel[arg]
  File "/Users/rcoe/.local/lib/python3.8/site-packages/h5pyd/_hl/selections.py", line 267, in __getitem__
    raise TypeError("PointSelection __getitem__ only works with bool arrays")
TypeError: PointSelection __getitem__ only works with bool arrays
MRossol commented 3 years ago

@ryancoe, two things: 1) On your end you can greatly simplify your code to look like this:

from rex import WindX
import os

# region you want to extract datasets for
region = 'Providence'
region_col = 'county'

# Create list of the sub-set of datasets you want to extract
datasets = ['{}_100m'.format(v) for v in ['windspeed', 'winddirection', 'pressure', 'temperature']]

wtk_file = '/nrel/wtk/conus/wtk_conus_2014.h5'
with WindX(wtk_file, hsds=True) as f:   
    out_path = # You need to supply a valid local file path
    f.save_region(out_path, region, datasets=datasets,
                       region_col=region_col)

2) You managed to find a bug that my tests didn't catch for some strange reason. I'm fixing it now and will release v0.2.40 momentariily

ryancoe commented 3 years ago

@MRossol - Awesome, thanks! This works great for WindX, but I get the following error when trying to use WaveX. The h5 still saves, but has only the first dataset and does not include any meta data (which WindX does automatically).

My code

```python from rex import WaveX import os # region you want to extract datasets for region = 'California' region_col = 'jurisdiction' # Create list of the sub-set of datasets you want to extract datasets = ['significant_wave_height','energy_period','time_index','meta','water_depth'] print(datasets) # wtk_file = '/nrel/wtk/conus/wtk_conus_2014.h5' wtk_file = '/nrel/US_wave/West_Coast/West_Coast_wave_2010.h5' with WaveX(wtk_file, hsds=True) as f: out_path = 'testwave.h5' # You need to supply a valid local file path f.save_region(out_path, region, datasets=datasets, region_col=region_col) ```

Traceback (most recent call last):
  File "test_save_region.py", line 16, in <module>
    f.save_region(out_path, region, datasets=datasets,
  File "test_save_region.py", line 16, in <module>
    f.save_region(out_path, region, datasets=datasets,
  File "/Users/rcoe/anaconda3/envs/waveResource/lib/python3.8/site-packages/rex/resource_extraction/resource_extraction.py", line 956, in save_region
    ds_out.attrs[k] = v
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/Users/rcoe/anaconda3/envs/waveResource/lib/python3.8/site-packages/h5py/_hl/attrs.py", line 103, in __setitem__
    self.create(name, data=value)
  File "/Users/rcoe/anaconda3/envs/waveResource/lib/python3.8/site-packages/h5py/_hl/attrs.py", line 203, in create
    attr.write(data, mtype=htype2)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5a.pyx", line 390, in h5py.h5a.AttrID.write
  File "h5py/_proxy.pyx", line 65, in h5py._proxy.attr_rw
  File "h5py/_conv.pyx", line 443, in h5py._conv.str2vlen
  File "h5py/_conv.pyx", line 94, in h5py._conv.generic_converter
  File "h5py/_conv.pyx", line 248, in h5py._conv.conv_str2vlen
TypeError: Can't implicitly convert non-string objects to strings
MRossol commented 3 years ago

@ryancoe, man you are on a role! Turns out this is a strange bug in h5pyd it isn't extracting the attributes properly. I'll add a quick fix to catch the error and skip that attr for now while we wait for a fix from h5pyd.

If you care: what the attributes look like from h5py:

significant_wave_height
IEC_name
H_s <class 'str'>
SWAN_name
HSIGN <class 'str'>
description
Calculated as the zeroth spectral moment (i.e., H_m0) <class 'str'>
dimensions
['time' 'position'] <class 'numpy.ndarray'>
src_name
Hsig <class 'str'>
units
m <class 'str'>
energy_period
IEC_name
T_e <class 'str'>
SWAN_name
TM02 <class 'str'>
description
Spectral width characterizes the relative spreading of energy in the wave spectrum. Large values indicate a wider spectral peak <class 'str'>
dimensions
['time' 'position'] <class 'numpy.ndarray'>
src_name
Tm_10 <class 'str'>
units
s <class 'str'>
time_index
dimensions
['time'] <class 'numpy.ndarray'>
timezone
UTC <class 'str'>
units
GMT <class 'str'>
meta
dimensions
['position'] <class 'numpy.ndarray'>
water_depth
IEC_name
h <class 'str'>
SWAN_name
DEPTH <class 'str'>
description
Grid node depth <class 'str'>
dimensions
['position'] <class 'numpy.ndarray'>
src_name
Depth <class 'str'>
units
m <class 'str'>

What they look like from h5pyd (HSDS)

significant_wave_height
IEC_name
H_s <class 'str'>
SWAN_name
HSIGN <class 'str'>
description
Calculated as the zeroth spectral moment (i.e., H_m0) <class 'str'>
dimensions
[array('time', dtype='<U4') array('position', dtype='<U8')] <class 'numpy.ndarray'>
src_name
Hsig <class 'str'>
units
m <class 'str'>
energy_period
IEC_name
T_e <class 'str'>
SWAN_name
TM02 <class 'str'>
description
Spectral width characterizes the relative spreading of energy in the wave spectrum. Large values indicate a wider spectral peak <class 'str'>
dimensions
[array('time', dtype='<U4') array('position', dtype='<U8')] <class 'numpy.ndarray'>
src_name
Tm_10 <class 'str'>
units
s <class 'str'>
time_index
dimensions
[array('time', dtype='<U4')] <class 'numpy.ndarray'>
timezone
UTC <class 'str'>
units
GMT <class 'str'>
meta
dimensions
[array('position', dtype='<U8')] <class 'numpy.ndarray'>
water_depth
IEC_name
h <class 'str'>
SWAN_name
DEPTH <class 'str'>
description
Grid node depth <class 'str'>
dimensions
[array('position', dtype='<U8')] <class 'numpy.ndarray'>
src_name
Depth <class 'str'>
units
m <class 'str'>
MRossol commented 3 years ago

@ryancoe , just released v0.2.41 with that work-around. NOTE: you won't get the attributes that HSDS can't load. I think its mainly dimensions