OSOceanAcoustics / echopype

Enabling interoperability and scalability in ocean sonar data analysis
https://echopype.readthedocs.io/
Apache License 2.0
96 stars 73 forks source link

Fix Xarray Duck Array Ops `invalid value encountered in cast` Runtime Warning in `ep.open_raw` #1335

Closed ctuguinay closed 3 months ago

ctuguinay commented 3 months ago

When opening and saving .RAW files from OOI, I encountered some duck_array_ops warnings:

# Save Echodata objects locally
def open_and_save(raw_file, sonar_model, use_swap, save_path, open_kwargs={}):
    try:
        ed = ep.open_raw(raw_file=raw_file, sonar_model=sonar_model, use_swap=use_swap, **open_kwargs)
        ed.to_zarr(save_path, overwrite=True, compute=True)
    except Exception as e:
        print("Error with Exception: ", e)

# Parse EK60 `.RAW` file and save to Zarr Store
open_and_save_futures = []
for raw_file_url in desired_raw_file_urls:
    open_and_save_future = client.submit(
        open_and_save,
        raw_file=raw_file_url,
        sonar_model='ek60',
        use_swap=True,
        save_path=single_echodata_zarr_dpath
    )
    open_and_save_futures.append(open_and_save_future)
open_and_save_futures = client.gather(open_and_save_futures)
/home/exouser/miniforge3/envs/echopype/lib/python3.9/site-packages/xarray/core/duck_array_ops.py:216: RuntimeWarning: invalid value encountered in cast
  return data.astype(dtype, **kwargs)
/home/exouser/miniforge3/envs/echopype/lib/python3.9/site-packages/xarray/core/duck_array_ops.py:216: RuntimeWarning: invalid value encountered in cast
  return data.astype(dtype, **kwargs)
/home/exouser/miniforge3/envs/echopype/lib/python3.9/site-packages/xarray/core/duck_array_ops.py:216: RuntimeWarning: invalid value encountered in cast
  return data.astype(dtype, **kwargs)

using xarray version '2024.5.0'.

I'm not quite sure what this is about, but this is probably something small and it should be fixed before any new echopype-examples notebook PRs get merged. I think it can be very disheartening seeing all these warnings pop up when using our package for the first time.

ctuguinay commented 3 months ago

So the problem here is a bit convoluted. I'll detail it step-by-step here:

Values in the Parser's ping_data_dict can be of different lengths depending on the channel. Let's take transmit mode for example:

image

These are then turned into DataArrays in the setting of groups and concatenated together by channel. Any 'missing' values will be turned into NaNs in this process.

Some of these values are then stored as np.byte or np.int in utils.coding.py:

image

This is a problem since setting the type of these arrays with np.byte and np.int will spit out a warning if it encounters something that can't be encoded to either of these, and more specifically, it will spit out a warning when encountering NaN values and turn the NaNs into 0s:

image

In a separate PR I will address this by not changing the type and instead changing these NaN values to the appropriate integer value, and I will focus just on transmit_mode/channel_mode since these values were the only ones that I noticed causing any problems. I will change any NaN values to -1 since that is what is shown in the attributes:

                "channel_mode": (
                    ["ping_time"],
                    np.array(self.parser_obj.ping_data_dict["transmit_mode"][ch], dtype=np.byte),
                    {
                        "long_name": "Transceiver mode",
                        "flag_values": [-1, 0, 1, 2],
                        "flag_meanings": ["Unknown", "Active", "Passive", "Test"],
                        "comment": "From transmit_mode in the EK60 datagram",
                    },
                ),
leewujung commented 3 months ago

@ctuguinay : Thanks for the investigation. Do you know why the 2 channels are shorter (305 and 304 entries for transmit_mode)? Are those 2 channels actually have smaller number of pings? (i.e. are there 609-305 and 609-304 all-NaN pings across the ping_time dimension? If so, it means that these channels were not pinging during the "missing" times. In this case, encoding these as -1 seems misleading and NaN is actually more accurate. But yeah, then we'll have to change the type to be np.float to allow NaN.

ctuguinay commented 3 months ago

Ah I see I see. I was under the impression that not pinging would be considered "Unknown" but I think I agree with you now that it should just be NaN. I'll make the type change to np.float.