Open cazaudo opened 1 month ago
I think this issue has been resolved with PR #57.
When you call generate_spectro
, the spectrogram metadata csv file is now created before the building of the job files:
dataset.prepare_paths()
spectrogram_metadata_path = dataset.save_spectro_metadata(False)
for batch in range(dataset.batch_number):
...
The path to this metadata csv is passed as an argument to the qsub_spectrogram_generator_pkg.py
script:
script_args=f"--dataset-path {dataset.path} "
f"--dataset-sr {dataset.dataset_sr} "
f"--batch-ind-min {i_min} "
f"--batch-ind-max {i_max} "
f"--spectrogram-metadata-path {spectrogram_metadata_path} " # HERE
f"{'--overwrite ' if overwrite else ''}"
f"{'--save-for-LTAS ' if save_welch else ''}"
f"{'--save-matrix ' if save_matrix else ''}",
And the Spectrogram
object in this script is instantiated from the updated metadata csv (line 58):
dataset = Spectrogram.from_csv(dataset_path = args.dataset_path, metadata_csv_path = args.spectrogram_metadata_path)
I'll let you confirm that these change solve your issue before closing it!
I have just tested it again and i don't think the problem has been solved because the batch_size
is derived from nber_files_to_process
(which is derived from dataset.list_audio_to_process
, l 215) independently from spectrogram_metadata_path
which is called too late
so even if the dataset path is modified the number of files to be processed will be equal to the number of original files and not to the number of reshaped ones
quick fix : dataset.list_audio_to_process
should be redefined based on dataset.spectro_duration
and dataset.dataset_sr
somewhere before calling it in generate_spectro
I'm not sure I understand the problem correctly, because I can't reproduce it.
Here's a short example, where I use a small dataset for which I:
so even if the dataset path is modified
It seems that our workflows seem to differ here. I think you don't have to change the dataset path:
With the following workflow, everything seems to work as intended:
dataset.spectro_duration = 600 # seconds
dataset.dataset_sr = 22_000 # Hz
datetime_begin = "2023-04-05T14:49:06+0000"
datetime_end = "2023-04-05T15:29:06+0000"
On dataset.initialize() call, the analysis folder is created as intended:
small_dataset
├── data
└── audio
├── 600_22000
│ ├── 2023_04_05_14_49_06.wav
│ ├── 2023_04_05_14_59_06.wav
│ ├── 2023_04_05_15_09_06.wav
│ ├── 2023_04_05_15_19_06.wav
│ ├── metadata.csv
│ └── timestamp.csv
└── 3600_128000
dataset.nfft = 1_024
dataset.window_size = 4_096
dataset.overlap = 10
dataset.concat = True
generate_spectro(
dataset=dataset,
path_osmose_dataset=path_osmose_dataset,
write_datasets_csv_for_aplose = write_datasets_csv_for_aplose,
overwrite=True,
save_matrix=False,
save_welch=False,
datetime_begin=datetime_begin,
datetime_end=datetime_end,
)
datetime_begin = "2023-04-05T14:49:00+0000"
datetime_end = "2023-04-05T15:00:00+0000" # changed end datetime
dataset.nfft = 512 # changed nfft
dataset.window_size = 4_096
dataset.overlap = 10
dataset.concat = True
generate_spectro(...)
datetime_begin = "2023-04-05T14:49:00+0000"
datetime_end = "2023-04-05T15:49:00+0000"
dataset.spectro_duration = 3600 # seconds
dataset.dataset_sr = 128_000 # Hz
dataset.nfft = 512
generate_spectro(...)
The spectrograms are created as intended for the 2 requested generations (the weird names of the spectros in the 3600_128000 just comes from the fact that the original audio files were not renamed):
processed
└── spectrogram
├── 600_22000
│ ├── 512_4096_10_linear
│ │ ├── image
│ │ │ ├── 2023_04_05_14_49_06_1_0.png
│ │ │ └── 2023_04_05_14_59_06_1_0.png
│ │ ├── matrix
│ │ └── metadata.csv
│ └── 1024_4096_10_linear
│ ├── image
│ │ ├── 2023_04_05_14_49_06_1_0.png
│ │ ├── 2023_04_05_14_59_06_1_0.png
│ │ ├── 2023_04_05_15_09_06_1_0.png
│ │ └── 2023_04_05_15_19_06_1_0.png
│ ├── matrix
│ └── metadata.csv
└── 3600_128000
└── 512_4096_10_linear
├── image
│ ├── 7189.230405144906_1_0.png
│ └── 7189.230405154906_1_0.png
├── matrix
└── metadata.csv
I just noticed the 7189.230405154906_1_0.png
appears to start later than the specified datetime_end = "2023-04-05T15:49:00+0000"
timestamp.
I thought this might have been a lead to your bug (maybe it created 2 spectrograms because the previous batch had 2 spectrograms too?), but further tests failed to confirm that.
I've reversed the order of the previous analyses (first a short time period one, then a full time period), and the batch number is correctly increased, as both spectrogram generations correctly cover the requested time periods.
I have created a test dataset on my side too so you can reproduce the problem , you can use this notebook /home/datawork-osmose/spectrogram_generator_issue58.ipynb on the dataset CINMS_C_test , using the osmose kernel (btw is it update ? )
actually even when starting the analysis from scratch , ie with only the folder of original audio files , which is outside the scope of this issue but nevermind , the workflow does not behave as expected , as you can see below : 18341 audio file reshaped , but only 855 spectro generated (ie the number of original audio files)..
ls /home/datawork-osmose/dataset/DCLDE2015_HF/CINMS_C_test/data/audio/60_32000/wav | wc -l 18341 ls /home/datawork-osmose/dataset/DCLDE2015_HF/CINMS_C_test/processed/spectrogram/60_32000/1024_1024_0_linear/image/png | wc -l 855
Note also the bug in the cell : monitor_job(dataset) of this notebook
Also wrong counting of files in progress bar , eg Audio file preparation : ONGOING ( 20331 / 536 )
We're having trouble accessing to Datarmor from ENSTA's network. I'll take a look as soon as the connection is back.
same bug observed on another dataset , reproduce it with /home/datawork-osmose/bug_repro_spectro_generator.ipynb
datarmor0 dcazau/osmose-toolkit% ls /home/datawork-osmose/dataset/glider_WHOI_2014_we04/data/audio/53_31000/*wav | wc -l 13379
datarmor0 dcazau/osmose-toolkit% ls /home/datawork-osmose/dataset/glider_WHOI_2014_we04/processed/spectrogram/53_31000/1024_1024_0_linear/image/*png | wc -l 6081
@cazaudo The problem should be fixed in the version that you can find in #64.
However, there seems to be a problem with the reshaping done in the glider_WHOI_2014_we04
dataset:
With the updated version of the toolkit, the 53_31000/1024_1024_0_linear
spectrogram generations leads to 12111
spectrograms out of 13379
audio files.
The problem appears to be that there are only 12111
listed files in the timestamp.csv
file. Did you noticed any error or warning during the dataset.initialize()
step?
There are 1268 missing files, which seem to belong to chunks scattered all along the dataset time period:
audio_folder = Path(r"/home/datawork-osmose/dataset/glider_WHOI_2014_we04/data/audio/53_31000")
timestamps = pd.read_csv(audio_folder / "timestamp.csv")
files = list(str(p.name) for p in audio_folder.glob("*.wav"))
t_files= [t for t in timestamps["filename"]]
missing_files = sorted(f for f in files if f not in t_files)
print("\n".join(missing_files[::50])) # Shows the name of the missing file by steps of 50 missing files
"""
2014_12_03_20_24_57.wav
2014_12_10_05_01_51.wav
2014_12_10_06_41_50.wav
2014_12_10_08_21_50.wav
2014_12_10_10_01_50.wav
2014_12_10_11_41_50.wav
2014_12_10_13_21_50.wav
2014_12_10_15_01_50.wav
2014_12_10_16_41_50.wav
2014_12_10_18_21_50.wav
2014_12_10_20_01_50.wav
2014_12_10_21_41_50.wav
2014_12_10_23_21_50.wav
2014_12_11_01_01_50.wav
2014_12_11_05_37_50.wav
2014_12_11_07_17_50.wav
2014_12_11_08_57_50.wav
2014_12_11_10_37_50.wav
2014_12_11_12_17_50.wav
2014_12_11_13_57_50.wav
2014_12_11_15_37_50.wav
2014_12_19_15_21_50.wav
2014_12_19_17_01_50.wav
2014_12_19_18_41_50.wav
2014_12_19_20_21_50.wav
2014_12_19_22_01_50.wav
"""
EDIT: further analyses of your dataset show a weird thing in your original data: Most are sampled with a rate of 60 kHz
, apart from these 4 files:
filename | sample rate |
---|---|
we04_ext_015_003363_20141211_020457_0994.wav | 19 kHz |
we04_ext_027_006551_20141219_223657_0993.wav | 23.6 kHz |
we04_ext_016_003584_20141211_164857_0984.wav | 17.5 kHz |
we04_ext_004_000764_20141203_204857_0994.wav | 23.5 kHz |
I don't know if that's related but it might have led to weird things during the reshaping?
We reshaped your original audio after removing the four files that weren't sampled at 60 kHz
We used the same file duration (53s) and a very small sample rate (100 Hz) to speed up the process.
This led to 15131 audio files, which is quite more than the 13379 files contained in your reshaped folder.
After running the patched spectrogram generation, I get the expected number of png files:
Audio files:.............15131
Processed files:.........15131
I merged the PR, I'll let you confirm that the bug is resolved on a healthy dataset!
when the reshaping has already been done (which prints the message as desired: "It seems these spectrogram parameters are already initialized. If it is an error or you want to rerun the initialization, add the force_init argument."), the attribute of dataset instance
list_audio_to_process
takes the list from the original dataset and NOT from the analysis dataset as desiredmy very dirty fix for now : before calling
generate_spectro
I define this variable as followsdataset.list_audio_to_process = [os.path.basename(x) for x in glob.glob('/home/datawork-osmose/dataset/boussole_MERMAID_v2/data/audio/60_32000/*wav')]