SECOORA / GUTILS

🌊 🚤 Python utilities for reading, merging, and post processing Teledyne Webb Slocum Glider data
MIT License
7 stars 11 forks source link

Echometrics improvements early 2023 #22

Closed jr3cermak closed 6 months ago

jr3cermak commented 1 year ago

Initiate PR with additional updates forthcoming. This is primarily to check results of upstream CI tests before proceeding with additional work.

jr3cermak commented 1 year ago

CI updates in progress via #20.

Tests within docker are ok and matches independent testing.

==== 27 passed, 8 deselected, 1 xpassed, 3034 warnings in 267.18s (0:04:27) ====
kwilcox commented 1 year ago

@jr3cermak Can you rebase? CI changes are merged

kwilcox commented 1 year ago

Can we completely replace convertDBD.sh (and the slocum binaries) with functionality from dbdreader at this point?

jr3cermak commented 1 year ago

It would be great to skip the ascii conversion stage and promote important bits into the 2nd stage (netcdf) code. We would then have access to direct data frames instead having to read them from intermediate files. I am not sure dbdreader can completely replace all of what the slocum binaries can do. For straight ascii conversion, yes. For our data processing pipeline, we have moved completely away from the slocum binaries. The slocum binaries ascii conversion does not provide enough floating point precision to decode the embedded echograms which is why dbdreader is necessary. After these updates (other PRs will follow), we can take a deeper dive into the ascii and convertDBD.sh code and see what can be done.

kwilcox commented 1 year ago

How about an intermediary format other than the ascii that currently exists (like 1->N parquet files)? I would like to keep a table-like serialization for other types of processing, distribution, analysis, etc. The netCDF format was really meant as the format to submit to the IOOS Glider DAC and is always going to be lossy.

jr3cermak commented 1 year ago

Switching to parquet should be fine. Any shift will be require an initial lift to get started. I think I get it now. You are also not enthralled with netcdf as the unifying backend. I had hopes seeing activity at xpublish. So, the goal is to do away with convertDBD.sh and the ascii part but make the converter more useful to serve better purpose than just throwing the ascii part away. Right now the general process is DBD->convertDBD.sh/dbdreader->ascii->netcdf->ERDDAP->data portal. The goal is DBD->parquet/dbdreader->parquet storage->netcdf. The data portal would at some point begin to pull from parquet storage. Is there a xpublish type framework that would sit on top of parquet as xpublish is envisioned for xarray/zarr?

kwilcox commented 1 year ago

Sounds like we are on the same page!

*.*bd files 
    -> dbdreader
        -> parquet files 
            -> profile netCDF files (via pocean) for glider dac
            -> backend analysis/viz (static plots, etc.)
            -> xpublish
                -> frontend analysis/viz (dynamic plots, etc.)

xpublish can sit on top of parquet files the same way it can sit upon xarray datasets... with a little plugin magic. I have a proof of concept that uses duckdb on top of parquet file served through xpublish and it is pretty nice for a quick API, that is the direction I'll be heading.

jr3cermak commented 1 year ago

Splendid. I will let you know when this PR finished and ready to go. Then we can look into pyarrow a bit more.

jr3cermak commented 1 year ago

This PR is good to go anytime. gutils/tests/test_slocum.py::TestEchoMetricsTwo::test_echogram has code that will read the echogram and put the profile into a dataframe: numpy, pandas and xarray. Initial starting point for migrating away from convertDBD.sh. The teledyne.py module has morphed a few times as it has encountered various sources of code. It can probably use an overhaul when we get closer to tackling the convertDBD.sh script and friends later. In an earlier life, it also moved away from the slocum binaries. Class functions still lurk in there even after converting to the dbdreader module. Progress.

jr3cermak commented 1 year ago

Finally ready to move forward with updates to echometrics processing and the addition of parquet as intermediate storage. The pytests all pass when run manually.

2023-08-28 14:06:24,221 - gutils.slocum - INFO - Converted usf-bass-2016-253-0-4.sbd,usf-bass-2016-253-0-4.tbd to usf_bass_2016_253_0_4_sbd.dat
2023-08-28 14:06:24,273 - gutils.slocum - INFO - Converted usf-bass-2016-253-0-5.sbd,usf-bass-2016-253-0-5.tbd to usf_bass_2016_253_0_5_sbd.dat
2023-08-28 14:06:24,425 - gutils.slocum - INFO - Converted usf-bass-2016-253-0-6.sbd,usf-bass-2016-253-0-6.tbd to usf_bass_2016_253_0_6_sbd.dat
PASSED
gutils/tests/test_watch.py::TestWatchClasses::test_gutils_netcdf_to_erddap_watch PASSED

================================================================================== warnings summary ==================================================================================
gutils/tests/test_nc.py: 3030 warnings
gutils/tests/test_slocum.py: 870 warnings
  /home/cermak/miniconda3/envs/gutils_py3_9/lib/python3.9/site-packages/compliance_checker/suite.py:185: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    latest_version = str(max(StrictVersion(v) for v in version_nums))

gutils/tests/test_slocum.py::TestEchoMetricsSix::test_echogram
gutils/tests/test_slocum.py::TestEchoMetricsSix::test_echogram
gutils/tests/test_slocum.py::TestEchoMetricsSix::test_echogram
gutils/tests/test_slocum.py::TestEchoMetricsSix::test_echogram
gutils/tests/test_slocum.py::TestEchoMetricsSix::test_echogram
  /home/cermak/miniconda3/envs/gutils_py3_9/lib/python3.9/site-packages/pyarrow/pandas_compat.py:354: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    if _pandas_api.is_sparse(col):

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================== 38 passed, 1 xpassed, 3905 warnings in 440.84s (0:07:20) ==============================================================

Workflow tests are not working.

Looks like the build process needs to know about dbdreader. Utilization of dbdreader will completely replace reliance on the x86 slocum binaries for decoding.

There is an odd dependency failure via conda/Docker:

E   ImportError: /lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /opt/conda/lib/python3.9/site-packages/scipy/fft/_pocketfft/pypocketfft.cpython-39-x86_64-linux-gnu.so)

This update has a hook to process glider data into parquet intermediate files. The parquet enabled processing a shade faster than the ascii method.

This update references issue #12, #24 and #26.

jr3cermak commented 1 year ago

This update to the PR includes a small fix that was discovered when trying to run unit tests on other platforms with different versions of OS and modules. Some small differences in magic or the way file operates, the shell must test for data or ASCII to allow unit tests to pass.

jr3cermak commented 1 year ago

From our standpoint, this is ready for implementation. Not sure what needs to be done to fix it in the workflow unit tests. Unit tests pass when run manually on all our platforms we have been testing.