Saving FOOOF results - Githubissues

TomDonoghue commented 6 years ago

There could probably be some user friendly saving options. Building this into FOOOF itself would also give us a change to have some level of 'FOOOF data' standardization, making it easier to potentially share fooof results.

Options to add to FOOOF:

Add method to save FOOOF object to pickle.
Add method to save out to standard format (JSON or csv).

However, these, given the current organization of FOOOF would be implemented at a PSD by PSD level.

Options for dealing with multiple PSDs (probably the most common use case):

Add separate functions (without a new object) that allow for / support running FOOOF across PSDs (basically, that implement the loop we currently typically write out).
Change the current FOOOF object to support multiple PSDs.
Add a 'MultiplePSDFOOOF' object. This would be initialized with the same settings as FOOOF, but take in a list of PSDs that (with the same frequency vector, and range), and internally run FOOOF across the dataset. It would include methods to save out group results. (I imagine this at a subject level. There could also even be an even higher level 'GroupOfSubjects' object).

Some of the above ideas are not mutually exclusive.

Perhaps the first line of decision points are:

How much do we add / support on our end in terms of support for multiple PSDs (across locations and also potentially across subjects), as opposed to letting users loop it as they need.
How much do add / support on our end in terms of IO operations, as opposed to letting users return parameters and save/load as they need.
Assuming we add at least some saving: Which save format to use, and what exactly should be saved out.

Another note: The options above basically presume the current organization (a base object basically designed to run on a single PSD), is reasonable, and we should perhaps keep in mind that a larger refactor is perhaps more sensible.

Extension:

Note that something like the GroupObject, or similar solutions, could also support extra useful functionality, for example optional support for running in parallel, and options to, given standardized location files, plot FOOOF results across channels (presumably adding new viz dependencies)
Some of this would start collapsing into, and could pull from code already used / implemented in the MEG project, with minor tweaks for more generalizability for others applying this stuff to their own datasets.

Also: these points are not particularly linked to v0.1, but are rather more general, and can mostly be addressed much later.

TomDonoghue commented 6 years ago

Marshmallow: A potential option, instead of pickle, to save out objects, but into standard filetypes (json), as opposed to python byte code.

http://marshmallow.readthedocs.io/en/latest/why.html#why

rdgao commented 6 years ago

what about hdf5? It's directory-like structure fits naturally with embedded dicts for keeping track of fitting and output parameters, as well as the data itself. Works in Python and Matlab.

TomDonoghue commented 6 years ago

@rdgao To my knowledge, HDF5 is designed and optimized for large data, which is not our case here. I'm thinking JSON would be simpler, and is the more common option, and is natively supported in Python. Unless I'm missing something that HDF5 does better than JSON for this kind of data?

parenthetical-e commented 6 years ago

JSON and n-dim arrays don’t generally play well.

(Also OMG folks it’s single purpose tool that at its heart returns two tuples and a scalar, a data format may be umm a bit overkill).

(e)

On Oct 23, 2017, at 12:32 PM, Tom notifications@github.com wrote:

@rdgao To my knowledge, HDF5 is designed and optimized for large data, which is not our case here. I'm thinking JSON would be simpler, and is the more common option, and is natively supported in Python. Unless I'm missing something that HDF5 does better than JSON for this kind of data?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

rdgao commented 6 years ago

@parenthetical-e get outta here and go roll around in your pool of cash. (but yes, that's one of my complaints about JSON. sucks for ndarray)

additionally, I think it's more about the landscape of neuro data in general, than our particular tool: 1) JSON nested fields can be arbitrarily named, which is flexible but adds to the confusion (imo). hdf5 nested data is always referred to as a "group", while parameters at the current level are attributes that can be named flexibly.

2) Neurodata without border (the big open data initiative) uses hdf, might as well be consistent with ongoing efforts seeing how that's what the R24 proposes.

3) hdf5 scales better, esp if we want to include the option of saving the PSD along with their fitted parameters, since, eventually, the workflow may start from time series. Starting from hdf5 might be expensive/an overkill for the use case right now, but it will be harder to scale later on if we do decide to use JSON and need to save a nd array, or even just a ton of parameters.

TomDonoghue commented 6 years ago

I'm with Erik, in that we don't need a whole data format for this - all we want to do is save out a couple numbers.

HDF5 seems like way overkill. It's for big, homogenous data. We have small, inhomogeneous results files. It's an extra dependency, with extra complexity.

Points above: 1) I'm not sure what you mean here. 2) NWB is a data format, for big data. We shouldn't be proposing a whole data format, for small, variable results files. NWB is not necessarily the most common format for input data, so I don't think we should over-tune to it. 3) If/when the project scales we can revisit then. We don't know yet know what that will look like.

For right now, a JSON dump is a trivial save-out format, and doesn't preclude something more added later. It has added benefits of being 'native' in python, human readable, more portable, and web-ready. For the FOOOF results, we don't have (homogenous) ndarrays anyway, so I'm still unclear on what the benefit of HDF5 would be.

Although belied by the title - my main Q here was not about format, or rather that as a secondary point, to choosing an approach to running FOOOF across multiple PSDs, which is more of a current issue, in my eyes. Although somewhat tangential to 'just get FOOOF going', this feels like a fairly prominent piece of the API I'd rather get sorted well from the get-go, rather than retro-hack, or leave it to open for people to use in unhelpful ways.

EDIT: what I'm suggesting is:

We have a clean way that is offered to run FOOOF across a group of PSDs.
If we say 'send me the FOOOF results', there should be an as simple as possible, standardized thing that looks like.

rdgao commented 6 years ago

all I'm saying is, people use hdf5, and it's becoming the standard for sharing data. The single unit people also use hdf5 for their cluster stuff, several of the datasets on CRCNS uses this (not necessarily the NWB format). But this is not a huge deal if included with fooof are import/export functions that handle whatever data format.

If we're just saving out the parameters, then yes, you could even just use a csv. But if there was any situation where you want to save the data (PSD) along with the input/output parameters, like how fieldtrip handles data and parameters, then JSON is obviously not appropriate.

So really, depends on your intent i.e the main question. If someone wants to submit a 3D PSD (trial x channel x frequency), and if it gets handled inside fooof, then it needs to deal with multi-dimensional homogeneous array for the save out, even if it's just the parameters. Or save a list of list of params in json, that works too.

TomDonoghue commented 6 years ago

I guess I just don't think FOOOF (currently) is (or should be, in current format) in the business of saving out PSDs. In the name of being lean, and I/O agnostic creating, saving and loading PSDs are things outside of the scope of FOOOF - done by the user however they like.

So while I totally agree HDF5 works if & when we have homogenous nd-arrays, we don't currently have that. If/when that changes, we revisit.

People use a lot of stuff - picking HDF5 for that reason is over-tuning to a particular use case.

parenthetical-e commented 6 years ago

(e)

On Oct 23, 2017, at 4:39 PM, Tom notifications@github.com wrote:

I guess I just don't think FOOOF (currently) is (or should be, in current format) in the business of saving out PSDs. In the name of being lean, and I/O agnostic creating, saving and loading PSDs are things outside of the scope of FOOOF - done by the user however they like.

Yes.

So while I totally agree HDF5 works if & when we have homogenous nd-arrays, we don't currently have that. If/when that changes, we revisit.

People use a lot of stuff - picking HDF5 for that reason is over-tuning to a particular use case.

Yes.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

rdgao commented 6 years ago

woops closed by accident. anyway, got it. I misunderstood then

TomDonoghue commented 6 years ago

Saving & group object stuff added in #42

TomDonoghue commented 6 years ago

FOOOFGroup and JSON saving added in #42

fooof-tools / fooof

Saving FOOOF results #40