RNO-G / mattak

RNO-G dataformats
1 stars 1 forks source link

Add dataset opts #25

Closed fschlueter closed 5 months ago

fschlueter commented 7 months ago

@cozzyd I created a PR for this branch. That makes it easier to review and discuss it. Can you provide some description for it?

fschlueter commented 7 months ago

@cozzyd I actually do not understand why it is necessary to implement something like DatasetOpt. So far the number of arguments passed to Dataset are manageable imo. Do you foresee much more configuration options which would make such a struct/class necessary? Also, does that maintain backwards compatibility with older files? (Is that something we are interested in at that point?)

cozzyd commented 7 months ago

This branch attempts to improve the interface for creating datasets, supporting additional use cases and future-proofing the C++ interface.

While we keep the old prototypes (and hopefully preserve API, and perhaps even ABI, compatibility for all relevant cases), we now prefer to create a mattak::Dataset with mattak::DatasetOptions, which will allow for additional settings in the future (for example for specifying how to load VoltageCalibration data) without become too cumbersome to use. Since C++ lacks named parameters for overriding defaults, it becomes difficult to manage past a few parameters. With a Options object, it's now possible to do things like this in C++ which would be much more cumbersome before:

   // If using an older compiler that doesn't support C++ designated initializers
   mattak::DatasetOptions opt;
   opt.verbose= true; 
   mattak::Dataset d(opt); 
   // If using a newer compiler that does support C++ designated initializers
   mattak::Dataset d ( { .verbose = true } ); 

This feature is not added to the Python interface since Python supports named parameters and duck typing, which simplifies evolution of the interface while maintaining API compatibility

We now support loading a named file (that is assumed to be in the combined.root) format by passing a file instead of a directory. We also support the concept of "preferred" files, which will override the default behavior (attempt to load full dataset, then attempt to load combined.root if that fails) by trying to load preferred.root if it exists (before reverting to default behavior). This allows preferring the combined.root file over the full waveform file if both exist, as well as allowing the option to create other types of subsamples in the combined.root format (for example, we can create a forced.root or calpulser.root, which will allow running over such datasets more efficiently, at the cost of extra disk space).

Since in python, the parameter name is part of the API (due to the ability to have named parameters), we keep data_dir called data_dir, even though it can now be a file. Perhaps we can also make the run/station number optional parameters in the Python interface, though that may look a bit clunky. Alternatively, we could create a new factory method DatasetFromFile( ) which might be a bit clearer to use?

Also, various other small fixes / improvements have been applied here opportunisticially, hopefully not breaking anything. mypy only complains about some small things now, which I'm not sure the best way to fix.