DISCUSSION: Structure of the dataset files

daguiam commented 6 years ago

We need to have a structure for the different raw datasets and where they are stored in the repository.

In this project, there are several kinds of datasets, such as raw measurements, example density profiles, real density profiles, turbulence measurements, and other reflectometry measurements. What do you think should be the correct structures for each of these?

I suggest separating the raw datasets between reflectometry techniques first. For now, we have: swept reflectometry measurements and fixed frequency measurements.

Directories

All datasets are stored in scikit-reflectometry/data/. The subdirectories may be:

data/raw/swept-frequency/
data/raw/fixed-frequency/
data/density-profiles/
data/turbulence/

Data structure

What is the size of each dataset? Is it a single sweep? How many points?

Should we store data in binary format? .csv? .mat files? .json? json is a nice format, similar to python dicts, and we may add meta information such as frequency range, sweep times, etc.

Functions

We should have the typical load_data, save_data functions. The loading functions understand the underlying dataset structure and should return python dicts with the loaded data and meta information!

The data or example module should handle loading the raw datasets such as

data =  skreflectometry.examples.raw_swept_reflectometry()
data['signal_I']   # In-phase  signal
data['signal_Q']  # Quadrature signal
data['signal']  # raw swept reflectometry interference signal, 
                # if there is an IQ signal, this should be complex
data['time'] # sweep time vector
data['frequency'] # sweep frequency vector

guimarais commented 6 years ago

Having a

/data/turbulence

directory only makes sense if we are going to store and discuss turbulent spectra and post-processing tools for it, like peak finding, spectral widths, Doppler peak analysis and so forth.

I personally hate dictionaries for entry-level users, it is easier for them to think in numpy arrays. However. dictionaries are easier to maintain. Not only we should have load and save functions, but we have to provide functions that convert numpy arrays from csvs into our dictionary structure. Maybe it's a stretch, but to simplify I would make it:

data['sI']   # In-phase  signal
data['sQ']  # Quadrature signal
data['s']  # raw swept reflectometry interference signal, 
                # if there is an IQ signal, this should be complex
data['t'] # sweep time vector
data['f'] # sweep frequency vector
data['r'] # radial positions in machine coordinates
data['z'] # vertical positions in machine coordinates
data['rp'] # normalized flux coordinates rho_pol
data['n'] # density in 10^19m^-3

Note that everything is SI except for density.

daguiam commented 6 years ago

The turbulence data might be included in the library. I think it should, eventually, since it is a reflectometry measurement. For the beginning of this project, I will personally focus on density profiles, since I am more comfortable with them.

Dictionaries are a part of Python and, even though they require too much writing, I think the benefits are worth it. Also, I prefer to keep the variable naming as literal as possible, meaning signal_I or signal_quadrature instead of sI, sQ, etc... It makes reading the code easier imho. We may convert the dictionaries into classes as well, which have the units embedded into the variables, for example. But that is secondary.

OpenReflectometry / scikit-reflectometry