OSOceanAcoustics / echopype

Enabling interoperability and scalability in ocean sonar data analysis
https://echopype.readthedocs.io/
Apache License 2.0
98 stars 72 forks source link

Time encoding wrt convention and file writing/reading behavior #329

Closed leewujung closed 3 years ago

leewujung commented 3 years ago

Currently when using open_raw (to parse data from raw instrument files), the time coordinates (e.g., ping_time) in the returned EchoData object is encoded in float64, whereas when using open_converted (to read from netcdf or zarr files) the time coordinate in the returned EchoData object is encoded in datetime64.

This inconsistency seems can be handled by casting the type to datetime64 when setting the time coordinates in the set group methods, but we need to experiment with both the writing and reading end of this to make sure things are done correctly. We would need a few tests for the various time coordinates also!

lsetiawan commented 3 years ago

For writing to file I think it would be really good if we are explicit about it and set an encoding, that way we always have consistent data types for the converted files: http://xarray.pydata.org/en/v0.17.0/io.html#writing-encoded-data

leewujung commented 3 years ago

Right, but I think the question is how we encode that when the variables are still in memory. The operation described here is for when the dataset is serialized to disk. Perhaps we can do a cast, and make sure if the cast results are serialized correctly (with explicit encoding) to disk.

lsetiawan commented 3 years ago

Perhaps we can do a cast, and make sure if the cast results are serialized correctly (with explicit encoding) to disk.

Yea if you do a cast then it will show up as whatever you cast it to in memory, then if we have explicit encoding, it doesn't matter what the data type is in memory, xarray will encode the serialized data based on what you pass in.

leewujung commented 3 years ago

ok, the path is set then! Let's do this in v0.5.1 ;)

lsetiawan commented 3 years ago

Let's do this in v0.5.1 ;)

For now should there be a warning for user to "serialize first then process"?

leewujung commented 3 years ago

ugh, haha, perhaps it could just be in v0.5.0, otherwise we are doing things twice.

lsetiawan commented 3 years ago

Okay. That's fine with me, if Kavin take a crack at it, I can review and help out. Otherwise, I'll have to wait till Friday to work on this.

leewujung commented 3 years ago

@ngkavin : could you read through the above ☝️ and submit a PR on this for @lsetiawan 's review, once you have submitted the PR for docs code block updates? Thanks!

ngkavin commented 3 years ago

It seems like datetime64 objects are automatically encoded with 'units' and 'calendar' values which can't be overwritten by the user, and will throw an error if we try to specify the 'units' and 'calendar' as attributes of ping_time. This issue is described in https://github.com/pydata/xarray/issues/3739

Error Message ` ValueError failed to prevent overwriting existing key units in attrs on variable 'ping_time'. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually. `

The ping_time is provided to SetGroups as a datetime64 array, but we do something like (self.parser_obj.ping_time - np.datetime64('1900-01-01T00:00:00')) / np.timedelta64(1, 's') in order to convert that to a float. We can remove this conversion so that EchoData objects will always have ping_time as a datetime64, but I think we would have to convert to an int or float right before we save to disk. Is this something that I should make a PR for?

lsetiawan commented 3 years ago

I think we would have to convert to an int or float right before we save to disk

You wouldn't need to reconvert as long as you specify encoding to your to_netcdf or to_zarr function.

to_netcdf: http://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_netcdf.html?highlight=to_netcdf#xarray.Dataset.to_netcdf to_zarr: http://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_zarr.html#xarray-dataset-to-zarr

Docs on xarray data encoding: http://xarray.pydata.org/en/stable/io.html#writing-encoded-data

ngkavin commented 3 years ago

I'm able to specify the encoding

ds.ping_time.encoding['units'] = 'seconds since 1234-02-03'   # random value
ds.ping_time.encoding['calendar'] = 'gregorian'

and can confirm that the ping_time is saved with that encoding, but this is only after I remove 'units' and 'calendar' from ping_time.attrs since I still error without doing so.

lsetiawan commented 3 years ago

What is the encoding on self.parser_obj.ping_time?

lsetiawan commented 3 years ago

I was thinking instead of overwriting encoding like you are doing on https://github.com/OSOceanAcoustics/echopype/issues/329#issuecomment-832364410

Also that seems more of attribute values rather than encoding!

This encoding happens during export e.g. to_netcdf(file.nc, encoding={'ping_time': {'dtype':'float64'}})

ngkavin commented 3 years ago

The ping_time is a numpy array so it doesn't have an encoding attribute, unless you mean the dtype which is datetime64.

I tried your suggestion and can't get it to work with a ping_time with dtype datetime64, but it does work if the ping_time is float64 and I want to save as int64.

Also that seems more of attribute values rather than encoding!

I guess that's just what these values in the encoding looks like since even if I remove the hardcoded units and calendar fields I set above for ds.ping_time.encoding as well as from the attributes, the saved ping_time looks like this:

image

Specifying in the attributes seconds since 1900-01-01 will change the 'units' field in the encoding.

It seems like xarray checks for the attributes and errors first before casting to the specified encoding.

lsetiawan commented 3 years ago

Okay so here's the result of my investigation and my suggested plan of action

Investigation

You can see that the attributes and the dtype of the two ping times are indeed different.

open_raw

xarray

open_converted

xarray

For open_raw, echopype manually converts this to the integers and sets the attributes with the appropriate units and calendar. When written to netcdf this is actually then written into the encoding of that variable by xarray. You can see the result with ncdump.

ncdump -h temp_echopype_output/17031001.nc

ncdump

This shows that the group was written correctly with the proper metadata about ping_time. When actually opening with open_converted, since by default xarray.open_dataset has decode_cf=True. Xarray will actually put those units and calendar into encoding of the xarray variable. This is the reason that we don't see them on the attribute anymore, if opening with decode_cf=False, you can actually see the attributes again.

xr.open_dataset

xarray

Solution

After the investigation above and seeing how things work, I think we should avoid converting the dtype ourselves like

ping_time = (self.parser_obj.ping_time - np.datetime64('1970-01-01T00:00:00')) / np.timedelta64(1, 's')

but instead letting xarray do it's job during serialization, by specifying the encoding.

ping_time = self.parser_obj.ping_time
ds = xr.Dataset({'temperature': (['ping_time'], self.parser_obj.unpacked_data['temperature'])},
                coords={'ping_time': (['ping_time'], ping_time,
                        {'axis': 'T',
                         'long_name': 'Timestamp of each ping',
                         'standard_name': 'time'})},
                attrs={'long_name': "Water temperature",
                       'units': "C"})
ds.ping_time.encoding = {'units': 'seconds since 1970-01-01', 
                         'calendar': 'gregorian', 
                         '_FillValue': np.nan, 
                         'dtype': np.float64}

If you do it this way, both open_raw and open_converted will have the same view and object within echodata.