Closed callumrollo closed 1 year ago
This is failing some tests at the moment. Some of those are expected (they use intermediate ncs which are now parquet files) others look like some errors with timestamp parsing and some other issues. I'll work on these
Just got to resolve the optional sensor-sepcific coarsening now
I wonder if this needs to be its own method? I'd need to be convinced that parquet files are a better intermediate format than netcdf. We need to be able to look at the intermediate files to see what is in the raw data - what is the workflow for parquet files? Just polars? I don't think xarray handles them. Does the efficiency go away if you use netcdf and polars?
We recently had this issue with the raw Alseamar files and solved it by subsampling the redundant information and reduced the raw files down from 10Mb each to 40kb
These are good points, thanks Jody. We've tried subsampling the raw alseamar files, but with the 16 Hz legato running for 3 week missions, we still end up with huge datasets and the load/merge step is taking several hours per dataset.
I'm starting to profile the code now. polars makes time savings over pandas when loading and merging the data. parquet files are quicker and more storage efficient than ncs to write and read, but they do need to be readable as intermediate products. This is achieved in a very similar way to pandas:
>>> df = pl.read_parquet("sea63_35_nrt_rawnc/sea063.0035.gli.sub.0142.parquet")
>>> df
shape: (107, 23)
┌─────────────────────┬──────────┬───────────────┬─────────┬─────┬────────┬─────────┬──────────┬──────┐
│ time ┆ NavState ┆ SecurityLevel ┆ Heading ┆ ... ┆ AngPos ┆ Voltage ┆ Altitude ┆ fnum │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i64 ┆ i64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ i64 │
╞═════════════════════╪══════════╪═══════════════╪═════════╪═════╪════════╪═════════╪══════════╪══════╡
│ 2022-03-01 23:50:06 ┆ 117 ┆ 0 ┆ 152.82 ┆ ... ┆ -5.0 ┆ 28.5 ┆ -1.0 ┆ 142 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-03-01 23:50:11 ┆ 110 ┆ 0 ┆ 154.58 ┆ ... ┆ 0.5 ┆ 28.5 ┆ -1.0 ┆ 142 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-03-01 23:50:21 ┆ 110 ┆ 0 ┆ 151.92 ┆ ... ┆ 0.5 ┆ 28.4 ┆ -1.0 ┆ 142 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-03-01 23:50:31 ┆ 110 ┆ 0 ┆ 156.13 ┆ ... ┆ 0.5 ┆ 28.4 ┆ -1.0 ┆ 142 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
to me this makes more sense, as at the load_merge stage there is no metadata to make the bulkier netcdfs necessary for these intermediate table-like files.
I'm working on a full demo at the moment. I'd like to make use of this speedup, but if it's incompatible with existing workflows I'll make it a separate function or just keep it for internal use at VOTO
I've put together a rough notebook profiling the polars implementation against the current main branch. It shows the performance differences using sub and raw data as well as showing what the intermediate parquet products look like when loaded in polars
https://github.com/callumrollo/pyglider_profile/blob/main/profile_pyglider.ipynb
It's pretty ugly, but shows the main points. I can work to make it portable//repeatable if desired
OK 5-10x is a big deal if you have 24 Hz data. If you do this, can you update the docs to explain what a parquet file is and a quick how-to on opening them? I think we can assume pandas and xarray knowledge (we are writing to netcdf), but parquet probably needs a paragraph of explanation.
Final question would be if parquet and pandas have any packaging issues? Does parquet work with a conda install pandas
on the big three OS's? (Can't be worse than netcdf ;-)
Would Zarr be a possible format option for pyglider?
I've had a look and it turns out pandas is able to read and write parquet files, so user shouldn't ever need to interact with polars. I'll add a paragraph to the docs explaining it.
I've not encountered problems adding polars to the build so far. I'll test it out on a windows machine today to be sure though. Polars has mature builds on PyPI and conda-forge
No issues making conda environments in pandas. I'm testing it on a range of our SeaExplorer datasets now. I'll mark ready for review once I'm satisfied it's performing well in production
I've tested this pretty extensively on our datasets now. I think it's good to go. @jklymak do you have any other requested changes?
Give me a day or two to look it over. Probably is all good.
I've added some more comments and used badfiles if a file fails the load. I have no idea how these changes have caused a couple of the slocum tests to start failing
OK I think we're good to go now. @jklymak has this met your requested changes? In future I'll make these kind of changes as part of a deprecation cycle as you suggest
@callumrollo feel free to self merge, perhaps squash commits that you don't need, or just squash on merge. Thanks! we are about to try this out on some data sets that have been giving us problems. ping @hvdosser
We have at least one dataset that has 16Hz legato data (plus 1Hz GPCTD), which would probably benefit from this change. Pinging @clayton to try this out on that mission.
@richardsc @clayton This is now merged, and on master. It's not yet released, so you will need to install from the development branch. Let us know if that is problematic. If it manages to get through a few more glider setups we should release as well as the changes to the slocum processing.
This is a big PR. More work needed to check that it's not changing data as the methods are not the same as pandas.
Testing on our datasets has shown a ~ 10 X speedup of processing large delayed mode datasets, and a ~5 X speedup with nrt data. I'll write some example code comparing the two methods along with more tests. This is almost certainly a sub-optimal implementation of polars, as I'm directly aping the existing flow for pandas.
Using polars also decreases disk usage of intermediate products by using parquet rather than .nc. It also has substantially lower memory usage than pandas, so should decrease overheads when processing large datasets.
This is designed to resolve #36