c-proof / pyglider

glider software
https://pyglider.readthedocs.io/
Apache License 2.0
17 stars 25 forks source link

use polars for seaexplorer data file load #120

Closed callumrollo closed 1 year ago

callumrollo commented 1 year ago

This is a big PR. More work needed to check that it's not changing data as the methods are not the same as pandas.

Testing on our datasets has shown a ~ 10 X speedup of processing large delayed mode datasets, and a ~5 X speedup with nrt data. I'll write some example code comparing the two methods along with more tests. This is almost certainly a sub-optimal implementation of polars, as I'm directly aping the existing flow for pandas.

Using polars also decreases disk usage of intermediate products by using parquet rather than .nc. It also has substantially lower memory usage than pandas, so should decrease overheads when processing large datasets.

This is designed to resolve #36

callumrollo commented 1 year ago

This is failing some tests at the moment. Some of those are expected (they use intermediate ncs which are now parquet files) others look like some errors with timestamp parsing and some other issues. I'll work on these

callumrollo commented 1 year ago

Just got to resolve the optional sensor-sepcific coarsening now

jklymak commented 1 year ago

I wonder if this needs to be its own method? I'd need to be convinced that parquet files are a better intermediate format than netcdf. We need to be able to look at the intermediate files to see what is in the raw data - what is the workflow for parquet files? Just polars? I don't think xarray handles them. Does the efficiency go away if you use netcdf and polars?

We recently had this issue with the raw Alseamar files and solved it by subsampling the redundant information and reduced the raw files down from 10Mb each to 40kb

callumrollo commented 1 year ago

These are good points, thanks Jody. We've tried subsampling the raw alseamar files, but with the 16 Hz legato running for 3 week missions, we still end up with huge datasets and the load/merge step is taking several hours per dataset.

I'm starting to profile the code now. polars makes time savings over pandas when loading and merging the data. parquet files are quicker and more storage efficient than ncs to write and read, but they do need to be readable as intermediate products. This is achieved in a very similar way to pandas:

>>> df = pl.read_parquet("sea63_35_nrt_rawnc/sea063.0035.gli.sub.0142.parquet")
>>> df
shape: (107, 23)
┌─────────────────────┬──────────┬───────────────┬─────────┬─────┬────────┬─────────┬──────────┬──────┐
│ time                ┆ NavState ┆ SecurityLevel ┆ Heading ┆ ... ┆ AngPos ┆ Voltage ┆ Altitude ┆ fnum │
│ ---                 ┆ ---      ┆ ---           ┆ ---     ┆     ┆ ---    ┆ ---     ┆ ---      ┆ ---  │
│ datetime[μs]        ┆ i64      ┆ i64           ┆ f64     ┆     ┆ f64    ┆ f64     ┆ f64      ┆ i64  │
╞═════════════════════╪══════════╪═══════════════╪═════════╪═════╪════════╪═════════╪══════════╪══════╡
│ 2022-03-01 23:50:06 ┆ 117      ┆ 0             ┆ 152.82  ┆ ... ┆ -5.0   ┆ 28.5    ┆ -1.0     ┆ 142  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-03-01 23:50:11 ┆ 110      ┆ 0             ┆ 154.58  ┆ ... ┆ 0.5    ┆ 28.5    ┆ -1.0     ┆ 142  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-03-01 23:50:21 ┆ 110      ┆ 0             ┆ 151.92  ┆ ... ┆ 0.5    ┆ 28.4    ┆ -1.0     ┆ 142  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-03-01 23:50:31 ┆ 110      ┆ 0             ┆ 156.13  ┆ ... ┆ 0.5    ┆ 28.4    ┆ -1.0     ┆ 142  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ...                 ┆ ...      ┆ ...           ┆ ...     ┆ ... ┆ ...    ┆ ...     ┆ ...      ┆ ...  │

to me this makes more sense, as at the load_merge stage there is no metadata to make the bulkier netcdfs necessary for these intermediate table-like files.

I'm working on a full demo at the moment. I'd like to make use of this speedup, but if it's incompatible with existing workflows I'll make it a separate function or just keep it for internal use at VOTO

callumrollo commented 1 year ago

I've put together a rough notebook profiling the polars implementation against the current main branch. It shows the performance differences using sub and raw data as well as showing what the intermediate parquet products look like when loaded in polars

https://github.com/callumrollo/pyglider_profile/blob/main/profile_pyglider.ipynb

It's pretty ugly, but shows the main points. I can work to make it portable//repeatable if desired

jklymak commented 1 year ago

OK 5-10x is a big deal if you have 24 Hz data. If you do this, can you update the docs to explain what a parquet file is and a quick how-to on opening them? I think we can assume pandas and xarray knowledge (we are writing to netcdf), but parquet probably needs a paragraph of explanation.

Final question would be if parquet and pandas have any packaging issues? Does parquet work with a conda install pandas on the big three OS's? (Can't be worse than netcdf ;-)

truedichotomy commented 1 year ago

Would Zarr be a possible format option for pyglider?

https://medium.com/pangeo/cloud-performant-reading-of-netcdf4-hdf5-data-using-the-zarr-library-1a95c5c92314

https://www.azavea.com/blog/2022/09/22/benchmarking-zarr-and-parquet-data-retrieval-using-the-national-water-model-nwm-in-a-cloud-native-environment/

callumrollo commented 1 year ago

I've had a look and it turns out pandas is able to read and write parquet files, so user shouldn't ever need to interact with polars. I'll add a paragraph to the docs explaining it.

I've not encountered problems adding polars to the build so far. I'll test it out on a windows machine today to be sure though. Polars has mature builds on PyPI and conda-forge

callumrollo commented 1 year ago

No issues making conda environments in pandas. I'm testing it on a range of our SeaExplorer datasets now. I'll mark ready for review once I'm satisfied it's performing well in production

callumrollo commented 1 year ago

I've tested this pretty extensively on our datasets now. I think it's good to go. @jklymak do you have any other requested changes?

jklymak commented 1 year ago

Give me a day or two to look it over. Probably is all good.

callumrollo commented 1 year ago

I've added some more comments and used badfiles if a file fails the load. I have no idea how these changes have caused a couple of the slocum tests to start failing

callumrollo commented 1 year ago

OK I think we're good to go now. @jklymak has this met your requested changes? In future I'll make these kind of changes as part of a deprecation cycle as you suggest

jklymak commented 1 year ago

@callumrollo feel free to self merge, perhaps squash commits that you don't need, or just squash on merge. Thanks! we are about to try this out on some data sets that have been giving us problems. ping @hvdosser

richardsc commented 1 year ago

We have at least one dataset that has 16Hz legato data (plus 1Hz GPCTD), which would probably benefit from this change. Pinging @clayton to try this out on that mission.

jklymak commented 1 year ago

@richardsc @clayton This is now merged, and on master. It's not yet released, so you will need to install from the development branch. Let us know if that is problematic. If it manages to get through a few more glider setups we should release as well as the changes to the slocum processing.