SuperDARN / pyDARNio

Python Library for read in SuperDARN data
GNU Lesser General Public License v3.0
8 stars 2 forks source link

Restructuring Borealis Array to Site #51

Closed RemingtonRohel closed 1 year ago

RemingtonRohel commented 2 years ago

BUG

Restructuring Borealis v0.6+ antennas_iq files from array to site format

Priority

Example of the bug

>>> import pydarnio
>>> infile = '20221102.2146.29.sas.0.antennas_iq.hdf5'
>>> reader = pydarnio.BorealisRead(infile, 'antennas_iq', 'array')
>>> records = reader.records
The file cannot be restructured due to the  following error: Arrays from 20221102.2146.29.sas.0.antennas_iq.hdf5: Error restructuring BorealisAntennasIq from array to site style: only integer scalar arrays can be converted to a scalar index
Traceback (most recent call last):
  File "/home/remington/pyDARNio/pydarnio/borealis/borealis_array.py", line 178, in records
    records = self.format._array_to_site(self.arrays)
  File "/home/remington/pyDARNio/pydarnio/borealis/base_format.py", line 1154, in _array_to_site
    timestamp_dict[key][field] = data_dict[field][index_slice]
TypeError: only integer scalar arrays can be converted to a scalar index

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/remington/pyDARNio/pydarnio/borealis/borealis.py", line 186, in records
    return self._reader.records
  File "/home/remington/pyDARNio/pydarnio/borealis/borealis_array.py", line 184, in records
    raise borealis_exceptions.BorealisRestructureError(
pydarnio.exceptions.borealis_exceptions.BorealisRestructureError: The file cannot be restructured due to the  following error: Arrays from 20221102.2146.29.sas.0.antennas_iq.hdf5: Error restructuring BorealisAntennasIq from array to site style: only integer scalar arrays can be converted to a scalar index

Attempts

Print statements in the code led me to find that pulse_phase_offset is the culprit, and for the file in question this field was filled with empty arrays.

I was able to correctly deal with the file by changing borealis_formats.py: Line 2314 to 'pulse_phase_offset': [],. This may need to be done for bfiq files as well, on Line 2200 of the same file.

Data Location

It's a big file (324 MB), reply back if you need it and I can get it to you.

Potential Bug Location

Line 2314 of borealis_formats.py

Potential Solution(s)

Change line 2314 of borealis_formats.py to 'pulse_phase_offset': [],

Extra Notes

Please provide other pertinent details about this feature:

RemingtonRohel commented 1 year ago

In the course of our regular data flow, the antennas_iq file above was restructured to array format using BorealisRestructure. I was unable to restructure it back to site format using the same method, with the error code and traceback shown above.

Looking at the internals of the file:

>>> infile = '20221108.1630.26.sas.0.antennas_iq.hdf5'
>>> import h5py
>>> with h5py.File(infile, 'r') as f:
... 
  File "<stdin>", line 2

    ^
IndentationError: expected an indented block
>>> f = h5py.File(infile, 'r')
>>> keys = sorted(list(f.keys()))
>>> print(keys)
['agc_status_word', 'antenna_arrays_order', 'beam_azms', 'beam_nums', 'blanked_samples', 'data', 'data_descriptors', 'gps_locked', 'gps_to_system_time_diff', 'int_time', 'lp_status_word', 'noise_at_freq', 'num_beams', 'num_blanked_samples', 'num_sequences', 'num_slices', 'pulse_phase_offset', 'pulses', 'scan_start_marker', 'slice_interfacing', 'sqn_timestamps']
>>> f['pulse_phase_offset'].shape
(2,)
>>> f['pulse_phase_offset'].size
2
>>> f['pulse_phase_offset']
<HDF5 dataset "pulse_phase_offset": shape (2,), type "<i8">
>>> f['pulse_phase_offset'][0]
1420
>>> f['pulse_phase_offset'][1]
0

What I would expect to see for this file (since there were no pulse_phase_offsets used in the experiment) is an empty numpy array. If pulse_phase_offset was actually used for the experiment, I would expect to see an array of shape [num_records, max_num_sequences, num_pulses].

For the corresponding site file, I would expect to see an empty array for the case where no pulse_phase_offset is specified by the experiment, and an array of shape [num_sequences, num_pulses] for the case where it is specified.

What I see instead, from looking at another file where the site format was generated and left untouched, is:

>>> import h5py
>>> f = h5py.File('20221109.1739.42.sas.0.antennas_iq.hdf5.site', 'r')
>>> keys = sorted(list(f.keys()))
>>> rec = f[keys[0]]
>>> rec['pulse_phase_offset']
<HDF5 dataset "pulse_phase_offset": shape (1,), type "<i8">
>>> rec['pulse_phase_offset'][:]
array([0])
>>> rec['pulses'][:]
array([ 0,  9, 12, 20, 22, 26, 27], dtype=uint32)

I am not quite sure why this is the case, from my digging into the Borealis software. I see nothing that indicates that a 0 should be written to file if the experiment does not specify a pulse_phase_offset field.

RemingtonRohel commented 1 year ago

Testing has revealed a problem with h5py/deepdish compatibility. It appears that writing an empty array with deepdish will yield undesired behaviour when reading the associated HDF5 file with h5py. See testing below for a simple example. Borealis currently writes data to file with deepdish, so either Borealis needs to change, or pyDARNio needs to change how it reads in the maximum site dimensions.

>>> import numpy as np
>>> import h5py
>>> import deepdish as dd

# Create HDF5 file with h5py
>>> f = h5py.File('tmp.h5', 'w')
>>> a = np.array([])
>>> print(a.shape)
(0,)
>>> f.create_dataset('a', data=a)
<HDF5 dataset "a": shape (0,), type "<f8">
>>> f.close()

# Try to read in the file with h5py
>>> g = h5py.File('tmp.h5', 'r')
>>> print(g['a'].shape)
(0,)
>>> g.close()

# Read in the file with deepdish
>>> h = dd.io.load('tmp.h5')
>>> h['a'].shape
(0,)
>>> h['a']
array([], dtype=float32)

# Create HDF5 file with deepdish
>>> dd.io.save('tmp2', h)

# Try to read the file with h5py
>>> f2 = h5py.File('tmp2', 'r')
>>> print(f2['a'].shape)                      ### THIS IS DIFFERENT THAN EXPECTED!
(1,)
>>> print(f2['a'])
<HDF5 dataset "a": shape (1,), type "<i8">
>>> f2.close()

# Lastly, confirm that the second file can be opened with deepdish
>>> h2 = dd.io.load('tmp2')
>>> h2
{'a': array([], dtype=float32)}
>>> h2['a'].shape
(0,)
RemingtonRohel commented 1 year ago

For discussion:

Since there is already many files saved by Borealis with deepdish, any solution needs to handle these files elegantly. One possible solution within pyDARNio could be to check just before this line in base_format.py: site_get_max_dims() https://github.com/SuperDARN/pyDARNio/blob/0ea8dff44b3739ac10c09cb595471892f51cc183/pydarnio/borealis/base_format.py#L606 that the dimensions of pulse_phase_offset make sense. We could cross-reference with num_pulses and num_sequences, or just assume that a size-1 array was meant to be a size-0 array.

I'm open to ideas/suggestions on this, and once we decide on a way to handle this I can go forward with implementing it.

carleyjmartin commented 1 year ago

Your use for Borealis is the main (possibly only?) use case for this part of pyDARNio, if you think it's the best option for you then I say go for it. I think maybe the cross referencing with num_pulses and num_sequences might be more 'proper' to do that assume 1=0 for shape, you never know I guess there could be a legitimate size 1 one day.

RemingtonRohel commented 1 year ago

Sounds great. I will come up with something that works then and open a PR for it, and we can test and review it there.