Nomenclature and dataset API

fjebaker commented 1 year ago

The difference between SpectralDataset and SimpleDataset is entirely ambiguous given only their names. I think it would be better if the nomenclature was reflective of what these structures actually are, and propose to rename:

SpectralDataset -> BinnedDataset
SimpleDataset -> Dataset

The relation to spectra is then also lifted, and we can instead use the Spectrum type to indicate what these containers are holding on to (c.f. timeseries). Dataset and BinnedDataset could similarly be expanded to have many of the same fields, such that there would be a non-invertable relation from BinnedDataset to Dataset which involves taking the midpoint of each bin.

Since the Spectrum is only storing channels and values, there is already a 1-to-1 correspondence, which is augmented by its container (i.e. BinnedDataset). The masking API would then similarly be defined for all AbstractDatasets.

Summary

AbstractData should be the struct which stores data only, whereas the AbstractDataset containers interpret the data in different ways. A container enriches the data with e.g. responses, ARFs, backgrounds, and provides the API the user iterracts with. AbstractData is instead used only internally to help rationalize implementing new AbstractDatasets and abstracts how the data itself is stored.

[ ] rename SpectralDataset and SimpleDataset.
[ ] Dataset and BinnedDataset to be made more homogeneous.
[ ] Spectrum to be AbstractData.
[ ] modify BinnedDataset and Dataset to be views on the underlying data, distinguishing the many-to-1 and 1-to-1 relation.

I think the dataset API should be planned here before these changes are made. Currently, the API is along the lines of

mask_domain!(dataset, f)
regroup!(dataset, grouping)

target_vector(dataset) # target for the fitting
domain_vector(dataset) # domain for the fitting
target_variance(dataset) # variance for the fitting
# + background versions

I propose to also add

get_mask(dataset)
quality_vector(dataset)

fjebaker commented 1 year ago

@phajy what do you think?

phajy commented 1 year ago

This sounds like a very sensible approach. The different datasets that I can think of immediately include multi-wavelength spectra (e.g., radio, optical flux densities), binned X-ray spectra, time series, power spectra, time lags versus frequency, and time lags versus energy. I believe these all fit naturally into this framework with the flexibility for other datasets we haven't thought of yet.

Oh, another area we might want to consider for the (more distant) future is fitting datasets with spatial information, e.g., images, or spectral cubes (a spectrum in each pixel).

phajy commented 1 year ago

I've been thinking a bit more about this from the practical standpoint of trying to fit a model to multiple datasets. If this is an XSPEC model it is computed as integrated counts in discrete energy bins. We might want to fit this to multiple spectral datasets, some of which will be X-ray datasets already in the expected format, but others could be, e.g, radio or optical flux densities that are not naturally in these units, or integrated over bin widths. Options might be to 1) import these as a BinnedDataset, or 2) import these as a Dataset and evaluate the model as a Dataset with the appropriate unit conversions. Perhaps we could discuss this. But overall I think the changes originally proposed make sense.

P.S. Also think the distinction between AbstractData and AbstractDataset make sense. P.P.S. This might also help fitting when simultaneously applying the same model to Dataset and BinnedDataset.

fjebaker commented 1 year ago

These are good points, but I think they still fit in the proposed changes. The Dataset (maybe better to call it something that reenforces the fact that it is essentially just a bijective mapping between two arrays -- given some x, what is the y, etc) and BinnedDataset are used and interacted with by the user, so that irrespective of what the underlying data is, the API is homogenous.

The X-ray / optical / radio data is instead read in as a Spectrum or a BinnedSpectrum or some other structure, which provides various translation so that they can fit inside a BinnedDataset or Dataset, so that models can always interact with a format of data they need.

Essentially spectra store the raw data as it is with some minimal accessor methods, and datasets give them "meaning" through their richer API, and optional combination with e.g. responses, and "know" if the data needs to be integrated or binned or whatever to work with a given model.

fjebaker commented 1 year ago

You can then fit multiple AbstractDatasets, each with completely different underlying data, but the models receive exactly what they expect thanks to the translation that the dataset API provides.

phajy commented 1 year ago

Just a note about an unusual use case. How flexible do we want to be about the bins in the datasets? E.g., let's say we have a dataset that has two (different) radio flux densities at the same frequency. Do we want to force the user to create two separate datasets, or could SpectralFitting handle this seeming inconsistency without any problems. The data points might also not be sorted or contiguous. It is computationally equivalent to two separate datasets but might be easier for the user to have one dataset. Not an important issue, but we can discuss / think about.

fjebaker / SpectralFitting.jl

Nomenclature and dataset API #53

Summary