Closed RemingtonRohel closed 1 year ago
In the course of our regular data flow, the antennas_iq
file above was restructured to array
format using BorealisRestructure. I was unable to restructure it back to site
format using the same method, with the error code and traceback shown above.
Looking at the internals of the file:
>>> infile = '20221108.1630.26.sas.0.antennas_iq.hdf5'
>>> import h5py
>>> with h5py.File(infile, 'r') as f:
...
File "<stdin>", line 2
^
IndentationError: expected an indented block
>>> f = h5py.File(infile, 'r')
>>> keys = sorted(list(f.keys()))
>>> print(keys)
['agc_status_word', 'antenna_arrays_order', 'beam_azms', 'beam_nums', 'blanked_samples', 'data', 'data_descriptors', 'gps_locked', 'gps_to_system_time_diff', 'int_time', 'lp_status_word', 'noise_at_freq', 'num_beams', 'num_blanked_samples', 'num_sequences', 'num_slices', 'pulse_phase_offset', 'pulses', 'scan_start_marker', 'slice_interfacing', 'sqn_timestamps']
>>> f['pulse_phase_offset'].shape
(2,)
>>> f['pulse_phase_offset'].size
2
>>> f['pulse_phase_offset']
<HDF5 dataset "pulse_phase_offset": shape (2,), type "<i8">
>>> f['pulse_phase_offset'][0]
1420
>>> f['pulse_phase_offset'][1]
0
What I would expect to see for this file (since there were no pulse_phase_offsets used in the experiment) is an empty numpy array. If pulse_phase_offset
was actually used for the experiment, I would expect to see an array of shape [num_records, max_num_sequences, num_pulses]
.
For the corresponding site
file, I would expect to see an empty array for the case where no pulse_phase_offset
is specified by the experiment, and an array of shape [num_sequences, num_pulses]
for the case where it is specified.
What I see instead, from looking at another file where the site
format was generated and left untouched, is:
>>> import h5py
>>> f = h5py.File('20221109.1739.42.sas.0.antennas_iq.hdf5.site', 'r')
>>> keys = sorted(list(f.keys()))
>>> rec = f[keys[0]]
>>> rec['pulse_phase_offset']
<HDF5 dataset "pulse_phase_offset": shape (1,), type "<i8">
>>> rec['pulse_phase_offset'][:]
array([0])
>>> rec['pulses'][:]
array([ 0, 9, 12, 20, 22, 26, 27], dtype=uint32)
I am not quite sure why this is the case, from my digging into the Borealis software. I see nothing that indicates that a 0 should be written to file if the experiment does not specify a pulse_phase_offset field.
Testing has revealed a problem with h5py/deepdish compatibility. It appears that writing an empty array with deepdish will yield undesired behaviour when reading the associated HDF5 file with h5py. See testing below for a simple example. Borealis currently writes data to file with deepdish, so either Borealis needs to change, or pyDARNio needs to change how it reads in the maximum site dimensions.
>>> import numpy as np
>>> import h5py
>>> import deepdish as dd
# Create HDF5 file with h5py
>>> f = h5py.File('tmp.h5', 'w')
>>> a = np.array([])
>>> print(a.shape)
(0,)
>>> f.create_dataset('a', data=a)
<HDF5 dataset "a": shape (0,), type "<f8">
>>> f.close()
# Try to read in the file with h5py
>>> g = h5py.File('tmp.h5', 'r')
>>> print(g['a'].shape)
(0,)
>>> g.close()
# Read in the file with deepdish
>>> h = dd.io.load('tmp.h5')
>>> h['a'].shape
(0,)
>>> h['a']
array([], dtype=float32)
# Create HDF5 file with deepdish
>>> dd.io.save('tmp2', h)
# Try to read the file with h5py
>>> f2 = h5py.File('tmp2', 'r')
>>> print(f2['a'].shape) ### THIS IS DIFFERENT THAN EXPECTED!
(1,)
>>> print(f2['a'])
<HDF5 dataset "a": shape (1,), type "<i8">
>>> f2.close()
# Lastly, confirm that the second file can be opened with deepdish
>>> h2 = dd.io.load('tmp2')
>>> h2
{'a': array([], dtype=float32)}
>>> h2['a'].shape
(0,)
For discussion:
Since there is already many files saved by Borealis with deepdish, any solution needs to handle these files elegantly. One possible solution within pyDARNio could be to check just before this line in base_format.py: site_get_max_dims()
https://github.com/SuperDARN/pyDARNio/blob/0ea8dff44b3739ac10c09cb595471892f51cc183/pydarnio/borealis/base_format.py#L606
that the dimensions of pulse_phase_offset
make sense. We could cross-reference with num_pulses
and num_sequences
, or just assume that a size-1 array was meant to be a size-0 array.
I'm open to ideas/suggestions on this, and once we decide on a way to handle this I can go forward with implementing it.
Your use for Borealis is the main (possibly only?) use case for this part of pyDARNio, if you think it's the best option for you then I say go for it.
I think maybe the cross referencing with num_pulses
and num_sequences
might be more 'proper' to do that assume 1=0 for shape, you never know I guess there could be a legitimate size 1 one day.
Sounds great. I will come up with something that works then and open a PR for it, and we can test and review it there.
BUG
Restructuring Borealis v0.6+ antennas_iq files from array to site format
Priority
Example of the bug
Attempts
Print statements in the code led me to find that
pulse_phase_offset
is the culprit, and for the file in question this field was filled with empty arrays.I was able to correctly deal with the file by changing
borealis_formats.py: Line 2314
to'pulse_phase_offset': [],
. This may need to be done for bfiq files as well, on Line 2200 of the same file.Data Location
It's a big file (324 MB), reply back if you need it and I can get it to you.
Potential Bug Location
Line 2314 of
borealis_formats.py
Potential Solution(s)
Change line 2314 of
borealis_formats.py
to'pulse_phase_offset': [],
Extra Notes
Please provide other pertinent details about this feature: