SpikeInterface / spikeinterface

A Python-based module for creating flexible and robust spike sorting pipelines.
https://spikeinterface.readthedocs.io
MIT License
533 stars 187 forks source link

Refactor pandas save load and convert dtypes #3412

Closed alejoe91 closed 2 months ago

alejoe91 commented 2 months ago

We found out that zarr consolidation doens't seem to play well with our way of saving/loading dataframes to zarr, using xarray.

xarray was only used to save/load pandas dataframes to zarr, and this PR modifies that by saving each column and the index directly. This is similar to how xarray saves to zarr, so it should be backcompatible (testing now).

To make sure we don't run in such problems in the future, I added a roundtrip test in the common extension tests that asserts that data reloaded is the same as the original ones.

As sugegsted by @h-mayorquin here #3365, the generated and reloaded dataframes are also converted to numpy dtypes with the convert_dtypes function. We just have to make sure to call a Series.to_numpy to cast pandas dtypes to numpy ones.

alejoe91 commented 2 months ago

Since I haven't fully tested zarr yet I want to make sure. We have an appropriate pandas warning somewhere for users so they know they need pandas for these features. I know we have a warning for qualitymetrics do we have one for templatemetrics?

We don't have warnings anywhere. If a user tries to compute template or quality metrics without pandas, it will throw an interpetable ModuleNotFoundError :)

alejoe91 commented 2 months ago

One last comment: for analyzers saved to zarr in version 0.101.0, the consolidation step was missing after the computation of each extension. I added a check and a consolitation step, that raises a warning if it fails

zm711 commented 2 months ago

Thanks Alessio!