Project-OSmOSE / OSEkit

OSEkit is an open source suite of tools written in python and dedicated to the management and analysis of data in underwater passive acoustics.
https://osmose.ifremer.fr
Other
3 stars 2 forks source link

Dataset directory structure #215

Open Gautzilla opened 4 days ago

Gautzilla commented 4 days ago

The current dataset directory structure suffers from some flaws. For example, running an analysis that differ from a previous one only in time period request overwriting the previous analysis.

In this issue, I try to expose these flaws by using an example dataset in which I run 4 analyses that differ by the audio parameters (time duration, sample rate) and/or by the fft parameters (in that case no reshaping of the audio files is needed). I'll first describe the analyses and the original dataset, then show the code snippets matching each analysis, and then the directory structure that results from these analyses.

Finally, I've added 2 draft directory structures:

What do you, as OSEkit users, think of these draft structures?

Example

An original dataset, from which 4 analyses are run:

Analysis Description
A1 Different audio length than original
Different start/end times than original
A2 Same audio parameters than A1 : no reshaping needed. Only fft parameters change
B Different start/end times than A1 and A2: reshaping needed.
C Different audio parameters than A1, A2 and B: reshaping needed.

Original Dataset :

audio_file_length = 3_600
sampling_frequency = 128_000
t_start = Timestamp("01-01-2023 00:00:00")
t_stop = Timestamp("03-01-2023 12:00:00")

Analyses :

Analysis A1

# Different time period than original
t_start = Timestamp("02-01-2023 00:00:00")
t_stop = Timestamp("02-01-2023 12:00:00")

# Different audio parameters than original
audio_length = 1_800
sampling_frequency = 128_000

nfft = 1_024
window_size = 4_096
overlap = 20
zoom_level = 0
scale = 'linear'

Analysis A2

# Same time period and audio parameters than A1: audio files doesn't need to be reshaped.
t_start = Timestamp("02-01-2023 00:00:00")
t_stop = Timestamp("02-01-2023 12:00:00")

audio_length = 1_800
sampling_frequency = 128_000

# Only fft parameters differ from analysis A1

nfft = 1_024
window_size = 2_048
overlap = 50
zoom_level = 5
scale = 'log'

Analysis B

# Different time period: reshape needed.

t_start = Timestamp("03-01-2023 00:00:00")
t_stop = Timestamp("03-01-2023 12:00:00")

audio_length = 1_800
sampling_frequency = 128_000

nfft = 1_024
window_size = 4_096
overlap = 20
zoom_level = 0
scale = 'linear'

Analysis C

t_start = Timestamp("02-01-2023 00:00:00")
t_stop = Timestamp("02-01-2023 12:00:00")

# Different audio parameters: reshape needed.

audio_length = 900
sampling_frequency = 64_000

nfft = 1_024
window_size = 4_096
overlap = 20
zoom_level = 0
scale = 'linear'

Current directory structure

dataset
    ├╴ data
    │   ├╴ audio
    │   │   ├╴ 1800_128000
    │   │   │   ├╴ audio_1a.wav
    │   │   │   ├╴ audio_2a.wav
    │   │   │   ├╴ ...
    │   │   │   ├╴ metadata.csv
    │   │   │   └╴ timestamp.csv
    │   │   ├╴ 900_64000
    │   │   │   └╴ ...
    │   │   └╴ 3600_128000
    │   │       ├╴ audio_1.wav
    │   │       ├╴ audio_2.wav
    │   │       ├╴ ...
    │   │       ├╴ file_metadata.csv
    │   │       ├╴ metadata.csv
    │   │       └╴ timestamp.csv
    │   └╴ auxiliary
    ├╴ other
    ├╴ log
    └╴ processed
        ├╴ adjustment_spectros
        │   ├╴ spectro_a1.png
        │   ├╴ spectro_a2.png
        │   └╴ adjust_metadata.csv
        └╴ spectrogram
            ├╴ 1800_128000
            │   ├╴ 1024_4096_20_linear
            │   │   ├╴ image
            │   │   │   ├╴ spectro_A1_1.png
            │   │   │   ├╴ spectro_A1_2.png
            │   │   │   └╴ ...
            │   │   ├╴ matrix
            │   │   └╴ metadata.csv
            │   └╴ 1024_2048_50_linear
            │       └╴ ...
            └╴ 900_64000
                └╴1024_4096_20_linear
                    └╴ ...

Problems:

Draft modifications of existing structure

dataset
    ├╴ data
    │   ├╴ audio
    │   │   ├╴ 1800_128000
    │   │   │   ├╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
    │   │   │   │   ├╴ audio_1a.wav
    │   │   │   │   ├╴ audio_2a.wav
    │   │   │   │   ├╴ ...
    │   │   │   │   ├╴ analysis_metadata.csv
    │   │   │   │   └╴ file_metadata.csv
    │   │   │   └╴ 2023-01-03_00-00-00__2023-01-03_12-00-00
    │   │   │       └╴ ...
    │   │   ├╴ 900_64000
    │   │   │   └╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
    │   │   │       └╴ ...
    │   │   └╴ 3600_128000_original
    │   │       └╴ 2023-01-01_00-00-00__2023-01-03_12-00-00
    │   │           ├╴ audio_1.wav
    │   │           ├╴ audio_2.wav
    │   │           ├╴ ...
    │   │           ├╴ analysis_metadata.csv
    │   │           └╴ file_metadata.csv
    │   └╴ auxiliary
    ├╴ other
    ├╴ logs
    └╴ output
        ├╴ adjustment_spectros
        │   ├╴ spectro_a1.png
        │   ├╴ spectro_a2.png
        │   └╴ adjust_metadata.csv
        └╴ spectrum
            ├╴ 1800_128000
            │   ├╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
            │   │   ├╴ 1024_4096_20_0_linear
            │   │   │   ├╴ spectrogram
            │   │   │   │   ├╴ spectro_A1_1.png
            │   │   │   │   ├╴ spectro_A1_2.png
            │   │   │   │   └╴ ...
            │   │   │   ├╴ matrix
            │   │   │   └╴ spectrum_metadata.csv
            │   │   └╴ 1024_2048_50_5_log
            │   │       └╴ ...
            │   └╴ 2023-01-03_00-00-00__2023-01-03_12-00-00
            │       └╴ 1024_4096_20_0_linear
            │           └╴...
            └╴ 900_64000
                └╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
                    └╴ 1024_4096_20_0_linear
                        └╴...

Remarks

There still are some flaws in this structure:

Draft new structure

dataset
    ├╴ 3600_128000_original
    │   └╴ 2023-01-01_00-00-00__2023-01-03_12-00-00
    │       ├╴ analysis.json
    │       ├╴ data
    │       │   ├╴ audio
    │       │   │   ├╴ audio_1.wav
    │       │   │   ├╴ audio_2.wav
    │       │   │   ├╴ ...
    │       │   │   └╴ audio.json
    |       |   └╴ auxiliary
    |       ├╴ log
    |       └╴ output
    ├╴ 1800_128000
    │   ├╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
    │   │   ├╴ analysis.json
    │   │   ├╴ data
    │   │   │   ├╴ audio
    │   │   │   │   ├╴ audio_1.wav
    │   │   │   │   ├╴ audio_2.wav
    │   │   │   │   ├╴ ...
    │   │   │   │   └╴ audio.json
    │   │   │   └╴ auxiliary
    │   │   ├╴ output
    │   │   │   ├╴ 1024_4096_20_0_linear
    │   │   │   │   ├╴ spectrogram
    │   │   │   │   │   ├╴ spectrogram_1.png
    │   │   │   │   │   ├╴ spectrogram_2.png
    │   │   │   │   │   └╴ ...
    │   │   │   │   ├╴ matrix
    │   │   │   │   └╴ spectrum.json
    │   │   │   └╴ 1024_2048_50_5_log
    │   │   │       ├╴ spectrogram
    │   │   │       │   ├╴ spectrogram_1.png
    │   │   │       │   └╴ ...
    │   │   │       ├╴ matrix
    │   │   │       └╴ spectrum.json
    │   │   └╴ log
    │   └╴ 2023-01-03_00-00-00__2023-01-03_12-00-00
    │       ├╴ analysis.json
    │       ├╴ data
    │       │   ├╴ audio
    │       │   │   ├╴ audio_1.wav
    │       │   │   ├╴ audio_2.wav
    │       │   │   ├╴ ...
    │       │   │   └╴ audio.json
    │       │   └╴ auxiliary
    │       ├╴ output
    │       │   └╴ 1024_4096_20_0_linear
    │       │       ├╴ spectrogram
    │       │       │   ├╴ spectrogram_1.png
    │       │       │   ├╴ spectrogram_2.png
    │       │       │   └╴ ...
    │       │       ├╴ matrix
    │       │       └╴ spectrum.json
    │       └╴ log
    └╴ 900_64000
        └╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
            ├╴ analysis.json
            ├╴ data
            │   ├╴ audio
            │   │   ├╴ audio_1.wav
            │   │   ├╴ audio_2.wav
            │   │   ├╴ ...
            │   │   └╴ audio.json
            │   └╴ auxiliary
            ├╴ output
            │   └╴ 1024_4096_20_0_linear
            │       ├╴ spectrogram
            │       │   ├╴ spectrogram_1.png
            │       │   ├╴ spectrogram_2.png
            │       │   └╴ ...
            │       ├╴ matrix
            │       └╴ spectrum.json
            └╴ log
MaelleTtrt commented 2 days ago

I like the proposed new structure, I think it solves most of our issues with the folder names. I just have a few questions:

For me, all adjustments spectrograms can be put into the same fodler, and they can also be deleted once the whole spectrogram genereation is launched.

Gautzilla commented 2 days ago

@MaelleTtrt

Easier question first:

  • You replaced time.csv and file_metadata.csv with audio.json and metadata.csv with spectrum.json?

The names are placeholders atm, but the idea is to make deserializable files: the analysis.json is a file that contains the (humanly readable and editable) informations on the analysis, and which is understandable by OSEkit for creating objects (such as a DasetAnalysis which you could recover later on)

It may not be a good idea for the file_metadata.csv because we want to keep a simple dictionary for tracking audio timestamps, or maybe we will want to create something like an AudioSet class that contains the audio metadata, plus methods for filtering them by timestamp or whatever. I guess I'll clarify all that as I progress in reformatting the package!

  • I don't understand where are the LTAS data in this structure ?

This simple, quick question led to a complicated, long discussion, which in turn led to another draft structure, involving even more drastic changes to OSEkit 👺

Basically, here are the changes we evoked:

Time period is moved above audio parameters in the structure:

This might help keeping track on which time regions of the dataset has already been analyzed

dataset
├── 2023-01-01_00-00-00__2023-01-03_12-00-00
│   └── 3600_128000_original
│       └── ...
├── 2023-01-02_01-00-00__2023-01-02_12-00-00
│   ├── 1800_128000
│   │   └── ... # A1 and A2
│   └── 900-64000
│       └── ... # C
└── 2023-01-03_01-00-00__2023-01-03_12-00-00
    └── 1800_128000
        └── ... # B

Store LTAS in the time period root

Would store LTAS as an analysis, but with specifing LTAS instead of the audio duration (which would implicitly be something like (t2-t1)/Timedelta(seconds = 1)):

For example, I want to generate a LTAS with a sr of 258 Hz over the whole example dataset time period, with a sr of 258 Hz, a time resolution of 30 minutes (that is, 258 * 1800 = 230400-samples-wide temporal windows), and nfft=256. Moreover, I want to generate a LTAS with the same parameters, only on the period covered by Analyses A1 & A2. This would lead to the following structure:

dataset
├── 2023-01-01_00-00-00__2023-01-03_12-00-00
│   ├── 3600_128000_original
│   │   ├── analysis.json
│   │   ├── data
│   │   ├── log
│   │   └── output      
│   └── LTAS_128
│       ├── analysis.json
│       ├── log
│       └── output
│           └── 1800_256
│               ├── spectrogram
│               ├── marix
│               └── spectrum.json
└── 2023-01-02_00-00-00__2023-01-02_12-00-00
    └── LTAS_128
        ├── analysis.json
        ├── log
        └── output
            └── 1800_256
                ├── spectrogram
                ├── marix
                └── spectrum.json

Replace nfft and window_size with frequency_resolution and temporal_resolution

As a time resolution of 20 ms is more obvious than a window size of 3840 samples at a sampling rate of 192 kHz, we might use these metrics primarily for creating the analyses?

This would imply some backstage checks: we should e.g. match nffts that are powers of 2 whatever the given frequency resolution:

    nfft = int(sample_rate // frequency_resolution)
    optimal_nfft = 1 << (int(nfft).bit_length() - 1) # warn the user if we change the frequency_resolution so that nfft matches optimal_nfft

or consider overlap in the computing of the time window sizes if a given temporal resolution leads to very small window sizes or whatever.

If these metrics appear to make more sense than the previous ones, we should still discuss how to include them in the directory structure, as the resolutions might be floating points: a LTAS could work with ~hours-long temporal resolutions, and an analysis looking for dolphin clicks with temporal resolutions in the order of a millisecond. Would we risk to add dots in the folder names (🤢) ? Should we note the resolution in milliseconds (and in millihertz for the frequency resolutions of campaigns that study whales ??)