c-proof / pyglider

glider software
https://pyglider.readthedocs.io/
Apache License 2.0
17 stars 24 forks source link

time_base='union' method #171

Open smwoodman opened 2 months ago

smwoodman commented 2 months ago

first pass at a time_base='union' method for binary_to_timeseries also changed several np.NaN to np.nan, per error from numpy 2.0

smwoodman commented 2 months ago

Thank you for the comments!

I added a commit reverting to NaN (although see comment above), putting the default method first, and starting doc updates. Other comments/questions follow

smwoodman commented 2 months ago

This isn't super clear, and not what I understood was going to happen here. If there are two science sensors on different time bases, I thought both times were going to be logged and NaN inserted for the sensors on a different time base.

I expected that too. In testing though, the dbdreader MultiDBD.get() behavior seemed to be to put all sci variables on sci_m_present_time. I.e., all science variables had the same number of values and returned the same timestamps. I think this is because of here?

Not clear on the "rounding to nearest second part". Surely there are plenty of sensors sampled faster than 1 Hz that this will be a bad assumption for. I'm not clear why you are doing this.

This I think is the key question for this method being general enough. Yep, there are plenty of sensors that sampled faster than 1Hz. Because index of the binary_to_timeseries() output is rounded to/returned as seconds) rather than eg ns, I felt that the merging should happen using the same time units. Otherwise, the merging would happen at the ns scale, but then the final index would be at the second scale and so there would be lots of timestamps with the same value.

Why would the engineering sensors be treated differently than a mismatched science sensor? This seems a mishmash of the two approaches, and you could imagine someone wanting the raw engineering data.

Very fair point, I'll update it. I did this based on our needs/interests, and agree it's inconsistent. Although, should the latitude and longitude values still be interpolated in your opinion?

smwoodman commented 1 month ago

Hi @jklymak, I added a commit such that no engineering values (including lat/lon) are interpolated.

I also wanted to check in about your thoughts on the above?

jklymak commented 1 month ago

OK< let's fix this step by step - I think times should definitely not be rounded, so that is a bug. I'll fix now.

jklymak commented 1 month ago

See #177 for better handling of the time.

smwoodman commented 1 month ago

Changes from #177 (and #173, #175, and #176) merged into this pr, and union of sci/eng times now happens at the int level

smwoodman commented 3 weeks ago

Hey @jklymak Sorry for the extra ping, but when you have a chance are there any other adjustments that I can make to this pr at the moment? Or is it close to something you're comfortable with?

jklymak commented 3 weeks ago

OK, my apologies - I guess I still think that this is too orthogonal to the main timeseries method. If you really want to go this way, can we just make a new method binary_to_raw_timeseries, name up to workshopping?

I understand the potential appeal of such a data set, but actually strongly feel you will end up in the end with something like what binary_to_timeseries gives you in the end. If I was really concerned about lining everything up with the CTD, I'd make separate netcdf files for each sensor rather than a single file full of NaNs, and binary_to_timeseries can already do this.

smwoodman commented 3 weeks ago

Thanks for your explanation and time with this! Totally respected wrt your feelings on this being too specific of a want for pyglider.

If you really want to go this way, can we just make a new method binary_to_raw_timeseries, name up to workshopping?

I don't follow how binary_to_raw_timeseries would be structured - it would be a new method with basically the same code as binary_to_timeseries, except for using dbd.get(*dbd.parameterNames, return_nans=True).?

Also, a question regarding a wrapper that eg makes a few binary_to_timeseries calls and merges relevant sensor values and metadata. Since utils.get_profile_new requires 'pressure' and can't identify profiles using 'm_depth', how do you recommend getting profile info for eg an engineering-only netcdf file, or a science netcdf file where science is only sampled on dives? Obvs let me know if this should be a separate issue/discussion.

jklymak commented 3 weeks ago

I don't follow how binary_to_raw_timeseries would be structured - it would be a new method with basically the same code as binary_to_timeseries, except for using dbd.get(*dbd.parameterNames, return_nans=True).?

Basically - if there is a lot of commonality, then we could think about refactoring. You have looked at it more carefully than I have though - if you really think it would be shorter to have the if/else loops as you've done here we should continue the discussion. I'm just looking at it from an end-user point of view that these are some different methods.

Since utils.get_profile_new requires 'pressure' and can't identify profiles using 'm_depth', how do you recommend getting profile info for eg an engineering-only netcdf file,

Yeah, we are discussing factoring out the profile creator. If needed we can quickly have a "skip profiles" flag that would work as well.

smwoodman commented 3 weeks ago

I'm just looking at it from an end-user point of view that these are some different methods.

Ahh thank you - it just finally clicked for me what 'raw' and 'timeseries' mean in pyglider world, so I now appreciate the importance of specifying 'raw' for such a method.

Basically - if there is a lot of commonality, then we could think about refactoring. You have looked at it more carefully than I have though - if you really think it would be shorter to have the if/else loops as you've done here we should continue the discussion.

There would be some duplication . However, after really looking at it I don't think it would be unreasonable as long as this chunk was factored out, ie was its own function that could be called by both binary_to_timeseries and binary_to_raw_timeseries (and maybe raw_to_timeseries as well). Name proposal timeseries_meta.?

If this all feels better to you, I'd be happy to take a stab at it? And if so, would you prefer that I update this pr or make a new one?

smwoodman commented 3 weeks ago

Yeah, we are discussing factoring out the profile creator. If needed we can quickly have a "skip profiles" flag that would work as well.

No need to change this quickly imo. To clarify, is part of the refactoring considering adding an argument to get_profile_new that would let the user specify which data variable to use to identify profiles, eg 'pressure' or 'm_depth'? And if not would you be open to me opening a new issue for this?