calebweinreb / SNUB

MIT License
24 stars 4 forks source link

[Minor] data copy logic #12

Closed vigji closed 8 months ago

vigji commented 10 months ago

Since there is such a minimal data manipulation (mostly, just rasterisation of the spike times and traces sorting, if I am correct), I do not understand the need to copy all the files in a project data folder. I can see that some data types offer the option of not been copied, just linked, but not all of them. But this is not only an issue for space, also for processing reasons. For example, it would force the user into having to re-run project generation if they change anything in the data preprocessing (eg, spike-detection or calcium signal extraction) which can be annoying and a source of mistakes.

calebweinreb commented 9 months ago

This is a good point and closely related to another issue you raised (https://github.com/calebweinreb/SNUB/issues/11) about conflation of data and GUI parameters. The rationale for both of these design choices is that SNUB is almost entirely focused on visualization (as opposed to data processing/transformation) and designed to be as fast/responsive as possible. It is meant to be somewhat analogous to a plotting library, e.g. with plt.scatter replaced by snub.io.add_scatter (though a few data processing utils such as wrappers for UMAP or rastermap are included just for convenience. The rationale for re-saving data in a fixed, deterministic format is that it allows SNUB to be agnostic about the original format that the data comes in, and to avoid performing costly conversion and/or preprocessing-steps every time a dataset is loaded.

The primary use case we have in mind is that a person has collected data and begun to analyze it in jupyter. They are totally comfortable loading the data and manipulating it with python. They likely are already generating static versions of SNUB-like data views, such as aligned heatmaps and trace plots, and now would like a frictionless way to pan/zoom around these plots and keep all the data views linked together. So instead of a big matplotlib figure, they create a SNUB project. Just as the matplotlib figure is agnostic to any loading/preprocessing that has already happened, so are the SNUB i/o routines.

Let me know if this clarifies both issues, and if you think it could be explained better in the paper. And thank you for the careful review!

niksirbi commented 9 months ago

The primary use case we have in mind is that a person has collected data and begun to analyze it in jupyter. They are totally comfortable loading the data and manipulating it with python. They likely are already generating static versions of SNUB-like data views, such as aligned heatmaps and trace plots, and now would like a frictionless way to pan/zoom around these plots and keep all the data views linked together. So instead of a big matplotlib figure, they create a SNUB project. Just as the matplotlib figure is agnostic to any loading/preprocessing that has already happened, so are the SNUB i/o routines.

This clarifies it, at least for me. Perhaps you could put more emphasis on this point in the paper and in the GitHub README/docs.

Regarding data duplication during project creation, at least you provide the option to avoid that for videos (via the copy parameter). But I wonder whether it's worth having that set to False by default, since videos may be quite large. I would also consider providing the copy=False option for more data types, where possible, especially those that are likely to be large files.

vigji commented 8 months ago

I understand the point of avoiding repeating every time pre-processing steps; this cost would remain in the no-copy configuration. Maybe in practical scenarios most users would just duplicate. Still, I agree on giving the option for not copying the data; again, not just for the sake of space but also to avoid potentially pernicious duplications that hide changes in the source data (although maybe for most uses source data would not change enough to justify this concern).

calebweinreb commented 8 months ago

These are good points! To avoid unnecessary data duplication, I've set copy=False as the default for videos, and now give users the option to provide heatmap or spiking data as a path (which is not copied) rather than an array (see https://github.com/calebweinreb/SNUB/commit/bf7127aeed030c0fffbdccabf54ca41955625df2). This option is still not available for the other GUI elements (scatter plots, trace plots, etc.) because that would have required more substantive reorganization of how the data for those elements is stored and read by the GUI. Still, I think this addresses most of the space issue since heatmaps and spiking data are usually the heftiest files by an order of magnitude.

I also updated the JOSS manuscript by adding the text below to clarify that SNUB is akin to a plotting library, as noted in the discussion above.

The main intended user is a researcher who has collected data and started to analyze it (e.g. in python). They may already be generating static versions of the visualizations afforded by SNUB, such as aligned heatmaps and trace plots, and now would like a frictionless way to pan/zoom around these plots and keep all the data views linked together. Importantly, SNUB should be thought of as akin to a plotting library, rather than a data analysis tool.

niksirbi commented 8 months ago

That's great @calebweinreb! This issue is resolved as far as I'm concerned.

vigji commented 8 months ago

For me as well!