Closed leewujung closed 3 years ago
For writing to file I think it would be really good if we are explicit about it and set an encoding, that way we always have consistent data types for the converted files: http://xarray.pydata.org/en/v0.17.0/io.html#writing-encoded-data
Right, but I think the question is how we encode that when the variables are still in memory. The operation described here is for when the dataset is serialized to disk. Perhaps we can do a cast, and make sure if the cast results are serialized correctly (with explicit encoding) to disk.
Perhaps we can do a cast, and make sure if the cast results are serialized correctly (with explicit encoding) to disk.
Yea if you do a cast then it will show up as whatever you cast it to in memory, then if we have explicit encoding, it doesn't matter what the data type is in memory, xarray will encode the serialized data based on what you pass in.
ok, the path is set then! Let's do this in v0.5.1 ;)
Let's do this in v0.5.1 ;)
For now should there be a warning for user to "serialize first then process"?
ugh, haha, perhaps it could just be in v0.5.0, otherwise we are doing things twice.
Okay. That's fine with me, if Kavin take a crack at it, I can review and help out. Otherwise, I'll have to wait till Friday to work on this.
@ngkavin : could you read through the above ☝️ and submit a PR on this for @lsetiawan 's review, once you have submitted the PR for docs code block updates? Thanks!
It seems like datetime64 objects are automatically encoded with 'units' and 'calendar' values which can't be overwritten by the user, and will throw an error if we try to specify the 'units' and 'calendar' as attributes of ping_time
. This issue is described in https://github.com/pydata/xarray/issues/3739
The ping_time is provided to SetGroups
as a datetime64
array, but we do something like (self.parser_obj.ping_time - np.datetime64('1900-01-01T00:00:00')) / np.timedelta64(1, 's')
in order to convert that to a float
. We can remove this conversion so that EchoData objects will always have ping_time
as a datetime64
, but I think we would have to convert to an int
or float
right before we save to disk. Is this something that I should make a PR for?
I think we would have to convert to an int or float right before we save to disk
You wouldn't need to reconvert as long as you specify encoding
to your to_netcdf
or to_zarr
function.
to_netcdf
: http://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_netcdf.html?highlight=to_netcdf#xarray.Dataset.to_netcdf
to_zarr
: http://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_zarr.html#xarray-dataset-to-zarr
Docs on xarray data encoding: http://xarray.pydata.org/en/stable/io.html#writing-encoded-data
I'm able to specify the encoding
ds.ping_time.encoding['units'] = 'seconds since 1234-02-03' # random value
ds.ping_time.encoding['calendar'] = 'gregorian'
and can confirm that the ping_time is saved with that encoding, but this is only after I remove 'units' and 'calendar' from ping_time.attrs since I still error without doing so.
What is the encoding on self.parser_obj.ping_time
?
I was thinking instead of overwriting encoding like you are doing on https://github.com/OSOceanAcoustics/echopype/issues/329#issuecomment-832364410
Also that seems more of attribute values rather than encoding!
This encoding happens during export e.g. to_netcdf(file.nc, encoding={'ping_time': {'dtype':'float64'}})
The ping_time is a numpy array so it doesn't have an encoding attribute, unless you mean the dtype which is datetime64
.
I tried your suggestion and can't get it to work with a ping_time with dtype datetime64
,
but it does work if the ping_time is float64
and I want to save as int64
.
Also that seems more of attribute values rather than encoding!
I guess that's just what these values in the encoding looks like since even if I remove the hardcoded units and calendar fields I set above for ds.ping_time.encoding
as well as from the attributes, the saved ping_time
looks like this:
Specifying in the attributes seconds since 1900-01-01
will change the 'units' field in the encoding.
It seems like xarray checks for the attributes and errors first before casting to the specified encoding.
Okay so here's the result of my investigation and my suggested plan of action
You can see that the attributes and the dtype of the two ping times are indeed different.
For open_raw
, echopype manually converts this to the integers and sets the attributes with the appropriate units
and calendar
. When written to netcdf this is actually then written into the encoding of that variable by xarray. You can see the result with ncdump
.
This shows that the group was written correctly with the proper metadata about ping_time
. When actually opening with open_converted
, since by default xarray.open_dataset
has decode_cf=True
. Xarray will actually put those units
and calendar
into encoding of the xarray variable. This is the reason that we don't see them on the attribute anymore, if opening with decode_cf=False
, you can actually see the attributes again.
After the investigation above and seeing how things work, I think we should avoid converting the dtype ourselves like
ping_time = (self.parser_obj.ping_time - np.datetime64('1970-01-01T00:00:00')) / np.timedelta64(1, 's')
but instead letting xarray do it's job during serialization, by specifying the encoding.
ping_time = self.parser_obj.ping_time
ds = xr.Dataset({'temperature': (['ping_time'], self.parser_obj.unpacked_data['temperature'])},
coords={'ping_time': (['ping_time'], ping_time,
{'axis': 'T',
'long_name': 'Timestamp of each ping',
'standard_name': 'time'})},
attrs={'long_name': "Water temperature",
'units': "C"})
ds.ping_time.encoding = {'units': 'seconds since 1970-01-01',
'calendar': 'gregorian',
'_FillValue': np.nan,
'dtype': np.float64}
If you do it this way, both open_raw
and open_converted
will have the same view and object within echodata.
Currently when using
open_raw
(to parse data from raw instrument files), the time coordinates (e.g.,ping_time
) in the returnedEchoData
object is encoded infloat64
, whereas when usingopen_converted
(to read from netcdf or zarr files) the time coordinate in the returnedEchoData
object is encoded indatetime64
.This inconsistency seems can be handled by casting the type to
datetime64
when setting the time coordinates in the set group methods, but we need to experiment with both the writing and reading end of this to make sure things are done correctly. We would need a few tests for the various time coordinates also!