Open Gautzilla opened 1 month ago
I like the proposed new structure, I think it solves most of our issues with the folder names. I just have a few questions:
For me, all adjustments spectrograms can be put into the same fodler, and they can also be deleted once the whole spectrogram genereation is launched.
@MaelleTtrt
Easier question first:
- You replaced time.csv and file_metadata.csv with audio.json and metadata.csv with spectrum.json?
The names are placeholders atm, but the idea is to make deserializable files: the analysis.json
is a file that contains the (humanly readable and editable) informations on the analysis, and which is understandable by OSEkit for creating objects (such as a DasetAnalysis
which you could recover later on)
It may not be a good idea for the file_metadata.csv
because we want to keep a simple dictionary for tracking audio timestamps, or maybe we will want to create something like an AudioSet
class that contains the audio metadata, plus methods for filtering them by timestamp or whatever. I guess I'll clarify all that as I progress in reformatting the package!
- I don't understand where are the LTAS data in this structure ?
This simple, quick question led to a complicated, long discussion, which in turn led to another draft structure, involving even more drastic changes to OSEkit 👺
Basically, here are the changes we evoked:
This might help keeping track on which time regions of the dataset has already been analyzed
dataset
├── 2023-01-01_00-00-00__2023-01-03_12-00-00
│ └── 3600_128000_original
│ └── ...
├── 2023-01-02_01-00-00__2023-01-02_12-00-00
│ ├── 1800_128000
│ │ └── ... # A1 and A2
│ └── 900-64000
│ └── ... # C
└── 2023-01-03_01-00-00__2023-01-03_12-00-00
└── 1800_128000
└── ... # B
Would store LTAS as an analysis, but with specifing LTAS instead of the audio duration (which would implicitly be something like (t2-t1)/Timedelta(seconds = 1)
):
For example, I want to generate a LTAS with a sr of 258 Hz over the whole example dataset time period, with a sr of 258 Hz, a time resolution of 30 minutes (that is, 258 * 1800 = 230400-samples-wide temporal windows), and nfft=256. Moreover, I want to generate a LTAS with the same parameters, only on the period covered by Analyses A1 & A2. This would lead to the following structure:
dataset
├── 2023-01-01_00-00-00__2023-01-03_12-00-00
│ ├── 3600_128000_original
│ │ ├── analysis.json
│ │ ├── data
│ │ ├── log
│ │ └── output
│ └── LTAS_128
│ ├── analysis.json
│ ├── log
│ └── output
│ └── 1800_256
│ ├── spectrogram
│ ├── marix
│ └── spectrum.json
└── 2023-01-02_00-00-00__2023-01-02_12-00-00
└── LTAS_128
├── analysis.json
├── log
└── output
└── 1800_256
├── spectrogram
├── marix
└── spectrum.json
nfft
and window_size
with frequency_resolution
and temporal_resolution
As a time resolution of 20 ms
is more obvious than a window size of 3840 samples
at a sampling rate of 192 kHz
, we might use these metrics primarily for creating the analyses?
This would imply some backstage checks: we should e.g. match nfft
s that are powers of 2 whatever the given frequency resolution:
nfft = int(sample_rate // frequency_resolution)
optimal_nfft = 1 << (int(nfft).bit_length() - 1) # warn the user if we change the frequency_resolution so that nfft matches optimal_nfft
or consider overlap in the computing of the time window sizes if a given temporal resolution leads to very small window sizes or whatever.
If these metrics appear to make more sense than the previous ones, we should still discuss how to include them in the directory structure, as the resolutions might be floating points: a LTAS could work with ~hours-long temporal resolutions, and an analysis looking for dolphin clicks with temporal resolutions in the order of a millisecond. Would we risk to add dots in the folder names (🤢) ? Should we note the resolution in milliseconds (and in millihertz for the frequency resolutions of campaigns that study whales ??)
As discussed with @mathieudpnt and @PaulCarvaillo, we might keep features that risk breaking the retro-compatibility for later, in a brighter future when OSEkit is reformatted and easier to maintain! ☀️
The current dataset directory structure suffers from some flaws. For example, running an analysis that differ from a previous one only in time period request overwriting the previous analysis.
In this issue, I try to expose these flaws by using an example dataset in which I run 4 analyses that differ by the audio parameters (time duration, sample rate) and/or by the fft parameters (in that case no reshaping of the audio files is needed). I'll first describe the analyses and the original dataset, then show the code snippets matching each analysis, and then the directory structure that results from these analyses.
Finally, I've added 2 draft directory structures:
What do you, as OSEkit users, think of these draft structures?
Example
An original dataset, from which 4 analyses are run:
Different start/end times than original
Original Dataset :
Analyses :
Analysis A1
Analysis A2
Analysis B
Analysis C
Current directory structure
Problems:
t_start
andt_stop
processed
: could be replaced byoutput
.spectrogram
andmatrix
could fall into aspectrum
upper levelDraft modifications of existing structure
Remarks
There still are some flaws in this structure:
metadata.csv
name is used several times for different usesfile_metadata.csv
andtimestamp.csv
contain redundant information, keep onlyfile_metadata.csv
?xxx_metadata.csv
files withxxx.json
files that could be used for serializing python classes? (e.g., ananalysis_dataset.json
file in each analysis folder that can be parsed to aDataset
object in OSEkit).Draft new structure
dataset\audiolength_samplerate\tstart_tend\
: correspond to one call to the reshaper module).data
andoutput
folders.