Closed ryancoe closed 3 years ago
@ryancoe If you are interested in entire datasets I'd recommend that you download the entire source file from S3: s3://nrel-pds-nsrdb/v3/nsrdb_2018.h5
Our HSDS service doesn't have the bandwidth to extract an entire dataset with is ~71 GB.
Thanks @MRossol! I was new to AWS, but found the following to work fine (posting here in case someone else finds it useful):
aws s3 ls s3://nrel-pds-nsrdb/v3/puerto_rico/ --no-sign-request --human-readable
aws s3 cp s3://nrel-pds-nsrdb/v3/puerto_rico/nsrdb_puerto_rico_2017.h5 --no-sign-request ~/Downloads/
One further question if you don't mind: It appears the full h5
files (e.g., s3://nrel-pds-nsrdb/v3/nsrdb_1998.h
) are actually ~1.5 TB. I guess what I was really hoping to do with my original question was download a subset of the data I'm interested in using rex
(probably via the Python interface) and save that as an h5
file. I believe this might go something like:
from rex import NSRDBX
nsrdb_file = '/nrel/nsrdb/v3/nsrdb_2018.h5'
with NSRDBX(nsrdb_file, hsds=True) as f:
ghi = f.get_region_df('ghi', 'California') # this puts the data in memory as a Pandas DataFrame
ghi.to_netcdf('savepath.h5') # this is the part that I'm missing
# alternatively, it would be nice to just directly save the data to disk without putting the whole thing in memory
f.get_region_df('ghi', 'California').to_netcdf('savepath.h5') # something like this?
@ryancoe, do you need all of the variables for the given region or just 1? If it's just one, we normally just dump the table produced by get_region_df
to a .csv file. If you would like all of the variables (or a subset of them) I can try to find time to add in a new method to dump a region to a new .h5 file.
@MRossol - Thanks very much for you quick reply and help! I think that I only want a subset of the variables, so if it is easy to create a method for dumping that to an h5
, that'd be really cool!
@ryancoe see v0.2.39 specifically the new save_region module
Here is the test for an example of how to implement it: https://github.com/NREL/rex/blob/ecaca4cc311a1410fcece785c075a3362d64a6e8/tests/test_resource_extraction.py#L752-L753
@MRossol - First off, thanks so much for your help with this, I really appreciate it! I tried running the following based on your test. I hope that I'm not missing something silly, but that's definitely a possibility.
from rex import WindX
import os
import tempfile
region = 'Providence'
region_col = 'county'
datasets = ['windspeed_100m', None, ['windspeed_100m']]
wtk_file = '/nrel/wtk/conus/wtk_conus_2014.h5'
with WindX(wtk_file, hsds=True) as f:
for dset in datasets:
print(dset)
with tempfile.TemporaryDirectory() as td:
out_path = os.path.join(td, 'test.h5')
try:
f.save_region(out_path, region, datasets=None,
region_col=region_col)
except Exception as e:
print(e)
As you can see, I was trying a couple of ways for setting the datasets
argument, but all of these return similar errors
Traceback (most recent call last):
File "test_save_region.py", line 19, in <module>
f.save_region(out_path, region, datasets=None,
File "test_save_region.py", line 19, in <module>
f.save_region(out_path, region, datasets=None,
File "/Users/rcoe/anaconda3/envs/waveResource/lib/python3.8/site-packages/rex/resource_extraction/resource_extraction.py", line 935, in save_region
data = ds[gids]
File "/Users/rcoe/.local/lib/python3.8/site-packages/h5pyd/_hl/dataset.py", line 729, in __getitem__
selection = sel.select(self, args)
File "/Users/rcoe/.local/lib/python3.8/site-packages/h5pyd/_hl/selections.py", line 88, in select
sel[arg]
File "/Users/rcoe/.local/lib/python3.8/site-packages/h5pyd/_hl/selections.py", line 267, in __getitem__
raise TypeError("PointSelection __getitem__ only works with bool arrays")
TypeError: PointSelection __getitem__ only works with bool arrays
@ryancoe, two things: 1) On your end you can greatly simplify your code to look like this:
from rex import WindX
import os
# region you want to extract datasets for
region = 'Providence'
region_col = 'county'
# Create list of the sub-set of datasets you want to extract
datasets = ['{}_100m'.format(v) for v in ['windspeed', 'winddirection', 'pressure', 'temperature']]
wtk_file = '/nrel/wtk/conus/wtk_conus_2014.h5'
with WindX(wtk_file, hsds=True) as f:
out_path = # You need to supply a valid local file path
f.save_region(out_path, region, datasets=datasets,
region_col=region_col)
2) You managed to find a bug that my tests didn't catch for some strange reason. I'm fixing it now and will release v0.2.40 momentariily
@MRossol - Awesome, thanks! This works great for WindX
, but I get the following error when trying to use WaveX
. The h5
still saves, but has only the first dataset and does not include any meta
data (which WindX
does automatically).
```python from rex import WaveX import os # region you want to extract datasets for region = 'California' region_col = 'jurisdiction' # Create list of the sub-set of datasets you want to extract datasets = ['significant_wave_height','energy_period','time_index','meta','water_depth'] print(datasets) # wtk_file = '/nrel/wtk/conus/wtk_conus_2014.h5' wtk_file = '/nrel/US_wave/West_Coast/West_Coast_wave_2010.h5' with WaveX(wtk_file, hsds=True) as f: out_path = 'testwave.h5' # You need to supply a valid local file path f.save_region(out_path, region, datasets=datasets, region_col=region_col) ```
Traceback (most recent call last):
File "test_save_region.py", line 16, in <module>
f.save_region(out_path, region, datasets=datasets,
File "test_save_region.py", line 16, in <module>
f.save_region(out_path, region, datasets=datasets,
File "/Users/rcoe/anaconda3/envs/waveResource/lib/python3.8/site-packages/rex/resource_extraction/resource_extraction.py", line 956, in save_region
ds_out.attrs[k] = v
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/Users/rcoe/anaconda3/envs/waveResource/lib/python3.8/site-packages/h5py/_hl/attrs.py", line 103, in __setitem__
self.create(name, data=value)
File "/Users/rcoe/anaconda3/envs/waveResource/lib/python3.8/site-packages/h5py/_hl/attrs.py", line 203, in create
attr.write(data, mtype=htype2)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5a.pyx", line 390, in h5py.h5a.AttrID.write
File "h5py/_proxy.pyx", line 65, in h5py._proxy.attr_rw
File "h5py/_conv.pyx", line 443, in h5py._conv.str2vlen
File "h5py/_conv.pyx", line 94, in h5py._conv.generic_converter
File "h5py/_conv.pyx", line 248, in h5py._conv.conv_str2vlen
TypeError: Can't implicitly convert non-string objects to strings
@ryancoe, man you are on a role! Turns out this is a strange bug in h5pyd it isn't extracting the attributes properly. I'll add a quick fix to catch the error and skip that attr for now while we wait for a fix from h5pyd.
If you care: what the attributes look like from h5py:
significant_wave_height
IEC_name
H_s <class 'str'>
SWAN_name
HSIGN <class 'str'>
description
Calculated as the zeroth spectral moment (i.e., H_m0) <class 'str'>
dimensions
['time' 'position'] <class 'numpy.ndarray'>
src_name
Hsig <class 'str'>
units
m <class 'str'>
energy_period
IEC_name
T_e <class 'str'>
SWAN_name
TM02 <class 'str'>
description
Spectral width characterizes the relative spreading of energy in the wave spectrum. Large values indicate a wider spectral peak <class 'str'>
dimensions
['time' 'position'] <class 'numpy.ndarray'>
src_name
Tm_10 <class 'str'>
units
s <class 'str'>
time_index
dimensions
['time'] <class 'numpy.ndarray'>
timezone
UTC <class 'str'>
units
GMT <class 'str'>
meta
dimensions
['position'] <class 'numpy.ndarray'>
water_depth
IEC_name
h <class 'str'>
SWAN_name
DEPTH <class 'str'>
description
Grid node depth <class 'str'>
dimensions
['position'] <class 'numpy.ndarray'>
src_name
Depth <class 'str'>
units
m <class 'str'>
What they look like from h5pyd (HSDS)
significant_wave_height
IEC_name
H_s <class 'str'>
SWAN_name
HSIGN <class 'str'>
description
Calculated as the zeroth spectral moment (i.e., H_m0) <class 'str'>
dimensions
[array('time', dtype='<U4') array('position', dtype='<U8')] <class 'numpy.ndarray'>
src_name
Hsig <class 'str'>
units
m <class 'str'>
energy_period
IEC_name
T_e <class 'str'>
SWAN_name
TM02 <class 'str'>
description
Spectral width characterizes the relative spreading of energy in the wave spectrum. Large values indicate a wider spectral peak <class 'str'>
dimensions
[array('time', dtype='<U4') array('position', dtype='<U8')] <class 'numpy.ndarray'>
src_name
Tm_10 <class 'str'>
units
s <class 'str'>
time_index
dimensions
[array('time', dtype='<U4')] <class 'numpy.ndarray'>
timezone
UTC <class 'str'>
units
GMT <class 'str'>
meta
dimensions
[array('position', dtype='<U8')] <class 'numpy.ndarray'>
water_depth
IEC_name
h <class 'str'>
SWAN_name
DEPTH <class 'str'>
description
Grid node depth <class 'str'>
dimensions
[array('position', dtype='<U8')] <class 'numpy.ndarray'>
src_name
Depth <class 'str'>
units
m <class 'str'>
@ryancoe , just released v0.2.41 with that work-around. NOTE: you won't get the attributes that HSDS can't load. I think its mainly dimensions
Why this feature is necessary: Is it possible to download entire dataset (e.g.,
/nrel/nsrdb/v3/nsrdb_2018.h5
) as some sort of binary file?I have considered the following alternatives: I tried using the CLI, but this appears to output
csv
files.To give more insight, what I would eventually like to do is work with the data using
xarray
anddask
. Downloading hugecsv
files is workable, but it seems like there is probably a better way that I'm unaware of.Thanks very much ;)