Project-OSmOSE / datarmor-toolkit

This repo gathers all our analytical codes to process big (mostly audio) data (eg related to machine learning, ambient noise analysis…)
0 stars 6 forks source link

bug with list_audio_to_process when computing spectros on an existing analysis dataset #58

Open cazaudo opened 1 month ago

cazaudo commented 1 month ago

when the reshaping has already been done (which prints the message as desired: "It seems these spectrogram parameters are already initialized. If it is an error or you want to rerun the initialization, add the force_init argument."), the attribute of dataset instance list_audio_to_process takes the list from the original dataset and NOT from the analysis dataset as desired

my very dirty fix for now : before calling generate_spectro I define this variable as follows

dataset.list_audio_to_process = [os.path.basename(x) for x in glob.glob('/home/datawork-osmose/dataset/boussole_MERMAID_v2/data/audio/60_32000/*wav')]

Gautzilla commented 1 month ago

I think this issue has been resolved with PR #57.

When you call generate_spectro, the spectrogram metadata csv file is now created before the building of the job files:

    dataset.prepare_paths()
    spectrogram_metadata_path = dataset.save_spectro_metadata(False)

    for batch in range(dataset.batch_number):
        ...

The path to this metadata csv is passed as an argument to the qsub_spectrogram_generator_pkg.py script:

script_args=f"--dataset-path {dataset.path} "
            f"--dataset-sr {dataset.dataset_sr} "
            f"--batch-ind-min {i_min} "
            f"--batch-ind-max {i_max} "
            f"--spectrogram-metadata-path {spectrogram_metadata_path} " # HERE
            f"{'--overwrite ' if overwrite else ''}"
            f"{'--save-for-LTAS ' if save_welch else ''}"
            f"{'--save-matrix ' if save_matrix else ''}",

And the Spectrogram object in this script is instantiated from the updated metadata csv (line 58):

dataset = Spectrogram.from_csv(dataset_path = args.dataset_path, metadata_csv_path = args.spectrogram_metadata_path)

I'll let you confirm that these change solve your issue before closing it!

cazaudo commented 2 weeks ago

I have just tested it again and i don't think the problem has been solved because the batch_size is derived from nber_files_to_process (which is derived from dataset.list_audio_to_process , l 215) independently from spectrogram_metadata_path which is called too late

so even if the dataset path is modified the number of files to be processed will be equal to the number of original files and not to the number of reshaped ones

quick fix : dataset.list_audio_to_process should be redefined based on dataset.spectro_duration and dataset.dataset_sr somewhere before calling it in generate_spectro

Gautzilla commented 2 weeks ago

I'm not sure I understand the problem correctly, because I can't reproduce it.

Here's a short example, where I use a small dataset for which I:

so even if the dataset path is modified

It seems that our workflows seem to differ here. I think you don't have to change the dataset path:

With the following workflow, everything seems to work as intended:

Dataset initialization

Original files:

Analysis:

dataset.spectro_duration = 600  # seconds
dataset.dataset_sr = 22_000  # Hz
datetime_begin = "2023-04-05T14:49:06+0000"
datetime_end = "2023-04-05T15:29:06+0000"

On dataset.initialize() call, the analysis folder is created as intended:

small_dataset
├── data
└── audio
    ├── 600_22000
    │   ├── 2023_04_05_14_49_06.wav
    │   ├── 2023_04_05_14_59_06.wav
    │   ├── 2023_04_05_15_09_06.wav
    │   ├── 2023_04_05_15_19_06.wav
    │   ├── metadata.csv
    │   └── timestamp.csv
    └── 3600_128000

Spectrogram generation:

First generation : covers the whole duration of the analysis:

dataset.nfft = 1_024
dataset.window_size = 4_096
dataset.overlap = 10
dataset.concat = True

generate_spectro(
    dataset=dataset,
    path_osmose_dataset=path_osmose_dataset,
    write_datasets_csv_for_aplose = write_datasets_csv_for_aplose,
    overwrite=True,
    save_matrix=False,
    save_welch=False,
    datetime_begin=datetime_begin,
    datetime_end=datetime_end,
)

Second generation : Different parameters, and covers a shorter time period:

datetime_begin = "2023-04-05T14:49:00+0000"
datetime_end = "2023-04-05T15:00:00+0000" # changed end datetime
dataset.nfft = 512 # changed nfft
dataset.window_size = 4_096
dataset.overlap = 10
dataset.concat = True

generate_spectro(...)

Third generation : Goes back to the original dataset audio shape:

datetime_begin = "2023-04-05T14:49:00+0000"
datetime_end = "2023-04-05T15:49:00+0000"
dataset.spectro_duration = 3600  # seconds
dataset.dataset_sr = 128_000  # Hz
dataset.nfft = 512

generate_spectro(...)

Output

The spectrograms are created as intended for the 2 requested generations (the weird names of the spectros in the 3600_128000 just comes from the fact that the original audio files were not renamed):

processed
└── spectrogram
    ├── 600_22000
    │   ├── 512_4096_10_linear
    │   │   ├── image
    │   │   │   ├── 2023_04_05_14_49_06_1_0.png
    │   │   │   └── 2023_04_05_14_59_06_1_0.png
    │   │   ├── matrix
    │   │   └── metadata.csv
    │   └── 1024_4096_10_linear
    │       ├── image
    │       │   ├── 2023_04_05_14_49_06_1_0.png
    │       │   ├── 2023_04_05_14_59_06_1_0.png
    │       │   ├── 2023_04_05_15_09_06_1_0.png
    │       │   └── 2023_04_05_15_19_06_1_0.png
    │       ├── matrix
    │       └── metadata.csv
    └── 3600_128000
        └── 512_4096_10_linear
            ├── image
            │   ├── 7189.230405144906_1_0.png
            │   └── 7189.230405154906_1_0.png
            ├── matrix
            └── metadata.csv

EDIT

I just noticed the 7189.230405154906_1_0.png appears to start later than the specified datetime_end = "2023-04-05T15:49:00+0000" timestamp.

I thought this might have been a lead to your bug (maybe it created 2 spectrograms because the previous batch had 2 spectrograms too?), but further tests failed to confirm that.

I've reversed the order of the previous analyses (first a short time period one, then a full time period), and the batch number is correctly increased, as both spectrogram generations correctly cover the requested time periods.

cazaudo commented 1 week ago

I have created a test dataset on my side too so you can reproduce the problem , you can use this notebook /home/datawork-osmose/spectrogram_generator_issue58.ipynb on the dataset CINMS_C_test , using the osmose kernel (btw is it update ? )

actually even when starting the analysis from scratch , ie with only the folder of original audio files , which is outside the scope of this issue but nevermind , the workflow does not behave as expected , as you can see below : 18341 audio file reshaped , but only 855 spectro generated (ie the number of original audio files)..

ls /home/datawork-osmose/dataset/DCLDE2015_HF/CINMS_C_test/data/audio/60_32000/wav | wc -l 18341 ls /home/datawork-osmose/dataset/DCLDE2015_HF/CINMS_C_test/processed/spectrogram/60_32000/1024_1024_0_linear/image/png | wc -l 855

cazaudo commented 1 week ago

Note also the bug in the cell : monitor_job(dataset) of this notebook

Also wrong counting of files in progress bar , eg Audio file preparation : ONGOING ( 20331 / 536 )

Gautzilla commented 1 week ago

We're having trouble accessing to Datarmor from ENSTA's network. I'll take a look as soon as the connection is back.

cazaudo commented 6 days ago

same bug observed on another dataset , reproduce it with /home/datawork-osmose/bug_repro_spectro_generator.ipynb

datarmor0 dcazau/osmose-toolkit% ls /home/datawork-osmose/dataset/glider_WHOI_2014_we04/data/audio/53_31000/*wav | wc -l 13379

datarmor0 dcazau/osmose-toolkit% ls /home/datawork-osmose/dataset/glider_WHOI_2014_we04/processed/spectrogram/53_31000/1024_1024_0_linear/image/*png | wc -l 6081

Gautzilla commented 3 days ago

@cazaudo The problem should be fixed in the version that you can find in #64.

However, there seems to be a problem with the reshaping done in the glider_WHOI_2014_we04 dataset:

With the updated version of the toolkit, the 53_31000/1024_1024_0_linear spectrogram generations leads to 12111 spectrograms out of 13379 audio files.

The problem appears to be that there are only 12111 listed files in the timestamp.csv file. Did you noticed any error or warning during the dataset.initialize() step?

There are 1268 missing files, which seem to belong to chunks scattered all along the dataset time period:

audio_folder = Path(r"/home/datawork-osmose/dataset/glider_WHOI_2014_we04/data/audio/53_31000")

timestamps = pd.read_csv(audio_folder / "timestamp.csv")
files = list(str(p.name) for p in audio_folder.glob("*.wav"))

t_files= [t for t in timestamps["filename"]]

missing_files = sorted(f for f in files if f not in t_files)

print("\n".join(missing_files[::50])) # Shows the name of the missing file by steps of 50 missing files
"""
2014_12_03_20_24_57.wav
2014_12_10_05_01_51.wav
2014_12_10_06_41_50.wav
2014_12_10_08_21_50.wav
2014_12_10_10_01_50.wav
2014_12_10_11_41_50.wav
2014_12_10_13_21_50.wav
2014_12_10_15_01_50.wav
2014_12_10_16_41_50.wav
2014_12_10_18_21_50.wav
2014_12_10_20_01_50.wav
2014_12_10_21_41_50.wav
2014_12_10_23_21_50.wav
2014_12_11_01_01_50.wav
2014_12_11_05_37_50.wav
2014_12_11_07_17_50.wav
2014_12_11_08_57_50.wav
2014_12_11_10_37_50.wav
2014_12_11_12_17_50.wav
2014_12_11_13_57_50.wav
2014_12_11_15_37_50.wav
2014_12_19_15_21_50.wav
2014_12_19_17_01_50.wav
2014_12_19_18_41_50.wav
2014_12_19_20_21_50.wav
2014_12_19_22_01_50.wav
"""

EDIT: further analyses of your dataset show a weird thing in your original data: Most are sampled with a rate of 60 kHz, apart from these 4 files:

filename sample rate
we04_ext_015_003363_20141211_020457_0994.wav 19 kHz
we04_ext_027_006551_20141219_223657_0993.wav 23.6 kHz
we04_ext_016_003584_20141211_164857_0984.wav 17.5 kHz
we04_ext_004_000764_20141203_204857_0994.wav 23.5 kHz

I don't know if that's related but it might have led to weird things during the reshaping?

Gautzilla commented 2 days ago

We reshaped your original audio after removing the four files that weren't sampled at 60 kHz

We used the same file duration (53s) and a very small sample rate (100 Hz) to speed up the process.

This led to 15131 audio files, which is quite more than the 13379 files contained in your reshaped folder.

After running the patched spectrogram generation, I get the expected number of png files:

Audio files:.............15131
Processed files:.........15131

I merged the PR, I'll let you confirm that the bug is resolved on a healthy dataset!