Audio-WestlakeU / ATST-SED

This repo includes the official implementations of "Fine-tune the pretrained ATST model for sound event detection".
MIT License
86 stars 13 forks source link

Testing the model with new data #8

Closed magicalvoice closed 2 months ago

magicalvoice commented 6 months ago

Hi,

@SaoYear Thank you for the great work. I am new to the problem of SED, I have fine-tuned iwth own data, Now, I just want to test the final fine-tuned model with test audio files, not having ground truth for the same. Is there any script to do that, without the need of preparing the .tsv files with onset offset event label etc. in the format of DESED data.

Basically, how will I use model for completely unknown input audio. if you could tell me in steps, It would be really helpful.

Thank you so much in advance!

SaoYear commented 6 months ago

Hi,

Thanks for your interests!

If I understand you correctly, what you want is to use the model to inference on some unlabeled audio clips.

This is the same process as we submit the evaluation results for the DCASE challenge. And this function is integrated within the DCASE baseline codes as well as in the codes of ATST-SED, supporting by the pytorch-lightning.

To do this:

  1. You might notice that, in config.yaml file (e.g., \train\configs\stage1.yaml file in this repo), there are only eval_folder and eval_folder_44k and no eval_tsv. So first you could modify these two terms to your own paths. If you have audios in 16kHz, you can just change eval_folder to your own path. Otherwise, you might need to change eval_folder_44k and the script would automatically resample the audios to 16kHz

  2. Run the evaluation, e.g., if you want to use fine-tuned model, run:

    train_stage2.py --gpus 0, --eval_from_checkpoint YOUR_PRETRAINED_CKPT_PATH

    The system would inference the data automatically and the predicted results would be stored in your exp folder.

Hope these help : )

magicalvoice commented 6 months ago

Thank you so much @SaoYear. This really helped, Much appreciate!!

magicalvoice commented 5 months ago

Hi @SaoYear,

I might ask silly doubts, but I am getting this error, when tried testing in the same way you described. Although I have ensured all the input file to be 10s each in duration and file sizes also same, I don't know why I am getting this error, and how to fix it.

Please help me @SaoYear

(dcase2023) empuser@server:~/ATST-SED-Scripts/ATST-SED/train$ python train_stage2.py --gpus 1 --eval_from_checkpoint exp/stage2/version_0/epoch=209-step=23100.ckpt /nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/train

/nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/train/train_stage2.py(505)() -> configs, args, test_model_state_dict, evaluation = prepare_run() (Pdb) c loaded model: exp/stage2/version_0/epoch=209-step=23100.ckpt at epoch: 209 Global seed set to 42 32 Loading ATST from: /nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/atst_as2M.ckpt Loading student from: /nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/train/exp/stage1/version_0/epoch=39-step=4400.ckpt Model loaded Loading ATST from: /nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/atst_as2M.ckpt Loading teacher from: /nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/train/exp/stage1/version_0/epoch=39-step=4400.ckpt Model loaded GPU available: False, used: False TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs Trainer(limit_train_batches=1.0) was configured so 100% of the batches per epoch will be used.. Trainer(limit_val_batches=1.0) was configured so 100% of the batches will be used.. Trainer(limit_test_batches=1.0) was configured so 100% of the batches will be used.. [rank: 0] Global seed set to 42 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=gloo All distributed processes registered. Starting with 1 processes

Testing DataLoader 0: 0%| | 0/32 [00:00<?, ?it/s]torch.Size([1, 1, 2505, 128]) torch.Size([1, 128, 626, 1]) shape of x original: torch.Size([1, 1001, 768]) shape of pos original torch.Size([1, 250, 768]) Traceback (most recent call last): File "train_stage2.py", line 505, in configs, args, test_model_state_dict, evaluation = prepare_run() File "train_stage2.py", line 374, in single_run trainer.test(desed_training) File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 794, in test return call._call_and_handle_interrupt( File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 88, in launch return function(*args, *kwargs) File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 842, in _test_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run results = self._run_stage() File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1188, in _run_stage return self._run_evaluate() File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1228, in _run_evaluate eval_loop_results = self._evaluation_loop.run() File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(args, kwargs) File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs) File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(args, kwargs) File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 137, in advance output = self._evaluation_step(kwargs) File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 234, in _evaluation_step output = self.trainer._call_strategy_hook(hook_name, kwargs.values()) File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1494, in _call_strategy_hook output = fn(*args, kwargs) File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 368, in test_step return self.model.test_step(*args, *kwargs) File "/nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/train/local/ultra_sed_trainer.py", line 529, in test_step strong_preds_student, weak_preds_student = self.detect(sed_feats, atst_feats, self.sed_student) File "/nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/train/local/ultra_sed_trainer.py", line 241, in detect return model(self.scaler(self.take_log(mel_feats)), pretrained_feats) File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/desed_task/nnet/CRNN_e2e.py", line 94, in forward embeddings = self.atst_frame(pretrain_x) File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/nfs/engine/empuser/anaconda3/envs/dcase2023/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/desed_task/nnet/atst/atst_model.py", line 18, in forward atst_x = self.atst.get_intermediate_layers( File "/nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/desed_task/nnet/atst/audio_transformer.py", line 212, in get_intermediatelayers x,,,,_,patch_length = self.prepare_tokens(x,mask_index=None,length=length,mask=False) File "/nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/desed_task/nnet/atst/audio_transformer.py", line 146, in prepare_tokens x = x + pos RuntimeError: The size of tensor a (1001) must match the size of tensor b (250) at non-singleton dimension 1 Testing DataLoader 0: 0%| | 0/32 [00:00<?, ?it/s]

SaoYear commented 5 months ago

Hi, it seems like you made some modifications on the audio_transformer.py (Since your line 146 in audio_transformer.py is different from the line in this repo). And now the length of your positional embeddings and patch embeddings are not aligned (one is 1001 and the other is 250).

I will attempt to explain what's happening in the function prepare_tokens, which might help you to debug:

  1. We apply a linear patching to transform each continuous 4 frames to a patch embedding, this would make the temporal resolution from 1001 to 1001 // 4 = 250;
  2. We use cut mode for positional embedding, get 250-length trainable positional embeddings and add them with the patch embeddings (line 143-144 here)

I would recommend you to print the shape of x and pos before the addition, so that you can know whose length is wrong, both should be 250.

Angelalilyer commented 4 months ago

Hi, it seems like you made some modifications on the audio_transformer.py (Since your line 146 in audio_transformer.py is different from the line in this repo). And now the length of your positional embeddings and patch embeddings are not aligned (one is 1001 and the other is 250).

I will attempt to explain what's happening in the function prepare_tokens, which might help you to debug:

  1. We apply a linear patching to transform each continuous 4 frames to a patch embedding, this would make the temporal resolution from 1001 to 1001 // 4 = 250;
  2. We use cut mode for positional embedding, get 250-length trainable positional embeddings and add them with the patch embeddings (line 143-144 here)

I would recommend you to print the shape of x and pos before the addition, so that you can know whose length is wrong, both should be 250.

  1. eval_folder

I have also encountered this problem. May I ask how many seconds of audio should be in "eval_folder"? My test set is 10s, it seems that the shape cannot match


audio, atst_feats, labels, padded_indxs, filenames = batch print(audio.shape) #[1, 441882]
sed_feats = self.mel_spec(audio) #should be [1, 128, 624] atst_feats = self.atst_norm(atst_feats) #should be [1, 64, 500] print(sed_feats.shape) # torch.Size([1, 128, 626]) print(atst_feats .shape) # torch.Size([1, 64, 1001])

SaoYear commented 4 months ago

The shape of your waveforms is incorrect. You should resample them to 16kHz.

To do so, you could refer to resample_data_generate_durations function (actually the resample_folder func in local.resample_folder) in the DESED baseline code. This function will automatically create a folder for you. And then you could change the eval_folder to the path of the created folder. The evaluation should work.

Angelalilyer commented 4 months ago

The shape of your waveforms is incorrect. You should resample them to 16kHz.

To do so, you could refer to resample_data_generate_durations function (actually the resample_folder func in local.resample_folder) in the DESED baseline code. This function will automatically create a folder for you. And then you could change the eval_folder to the path of the created folder. The evaluation should work.

Hello! Thank you very much for your help! But I still have some questions:

  1. I checked my log output and found that the predicted categories only include the following types, "Alarm bell ring Blender Cat Dishes Dog Electricshaver_tootbrush Frying Running_water Speed Vacuum_cleaner" The model I used is "stage2_wo_external. ckpt". Is this a normal result?
  2. When I use "stage2_w_external. ckpt", I get an error message: "RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory". It is likely that the model is damaged and I am unable to load it.
SaoYear commented 4 months ago
  1. Yes, this ATST-SED model is designed for DESED dataset. These 10 classes are exactly the classes defined by the DESED dataset (DCASE challenge task 4). If you want to recognize more classes such as AudioSet Strong, you should refer to ATST repo here, instead of ATST-SED.
  2. The _external.ckpt is broken. But the _wo_external.ckpt performs very similarly. You could use the _wo_external.ckpt one. Actually, we did not pay much attention on finetuning the _external.ckpt , this checkpoint is just to use for fair comparisons with other methods in our paper.
Angelalilyer commented 3 months ago
  1. Yes, this ATST-SED model is designed for DESED dataset. These 10 classes are exactly the classes defined by the DESED dataset (DCASE challenge task 4). If you want to recognize more classes such as AudioSet Strong, you should refer to ATST repo here, instead of ATST-SED.
  2. The _external.ckpt is broken. But the _wo_external.ckpt performs very similarly. You could use the _wo_external.ckpt one. Actually, we did not pay much attention on finetuning the _external.ckpt , this checkpoint is just to use for fair comparisons with other methods in our paper.

Hello! Your reply was very helpful to me. I carefully reviewed the code for "ATST Frame". If I want to obtain frame level sound event detection for my test dataset, should I change and run this part of the code?

######################## Frame-level downstream tasks

DESED please see sehll/downstream/finetune_dcase Strongly labelled AudioSet please see shell/downstream/finetune_as_strong #########################

The inference code is "audiossl/audiossl/methods/atsframe/downstream/train_as_strong. py" The inference model is "atstframe_base.ckpt" May I ask if my guess is correct? Thanks~~!!

SaoYear commented 3 months ago

is there a way to try the model inference on different durations of audio? maybe cutting the audios into frames before using the model or changing something within the model to support longer audio?

Yeah, you could refer to what we've done in the ATST-RCT system, last paragraph of section 3.

Quick summary:

  1. use a fix-length window (say, W seconds) to shift through the long-duration audio (L seconds);
  2. keep a hop length (K seconds);
  3. you will get (L - W) / K + 1 (represented as N) W-second audio clips by this process
  4. inference these N audio clips
  5. aggregate the result (you might average or use logical OR for the overlapped frames) and pass to the median filter
SaoYear commented 3 months ago

Okay thank you! Is there an implemetation of this already ?

You could refer to ATST-RCT repo, I just uploaded a neccessary file.

Please see the test_step in the trainer file for this part of implementation.

SaoYear commented 3 months ago

Yeah, there are three steps:

  1. split the audio into shorted clips, in the test_step line 651-673
  2. give the to the model (forward function)
  3. unify the predictions (pull_back_preds function in utils.py)
magicalvoice commented 3 months ago

Hi, have anyone written separate code for just inference, like loading the model, trained weights and running it on 10 s audios to get inference per file [may be some post processing too].

If anyone has done it, please help me with that, how to do it.

Thank you.

SaoYear commented 3 months ago

@martineghiazaryan @magicalvoice I will write a quick inference script

SaoYear commented 3 months ago

will finish it by this weak ; )在 2024年7月11日,20:28,Martin Yeghiazaryan @.***> 写道: @SaoYear hey any news from the script?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

SaoYear commented 3 months ago

Hey guys, sorry for the late but I have update an inference file in the latest commit.

You could use the inference file by: python -m inference And the path of the waveform to be inferenced could be changed inside the code.

If you have any other problem, please let me know.

magicalvoice commented 3 months ago

@SaoYear First of all, Thank you so much for your kind help!!

I have another question, How to interpret this result inference_result.png, like True False for each chunk based on a threshold, but which class it belongs ?

Also, I want to clarify a doubt, I read your paper "Fine-tune the pretrained ATST model for sound event detection", it is basically training and finetuning ATST-Frame model along with the help of CRNN, because DESED is quite small data for finetuning ATST-Frame? but then Who is Teacher and who is Student here in this case, cause I am getting confused by looking at results for both stages has student and teacher.

I am stuck with lot of questions, Please help, Thanks in advance!!

SaoYear commented 3 months ago

@SaoYear First of all, Thank you so much for your kind help!!

I have another question, How to interpret this result inference_result.png, like True False for each chunk based on a threshold, but which class it belongs ?

Also, I want to clarify a doubt, I read your paper "Fine-tune the pretrained ATST model for sound event detection", it is basically training and finetuning ATST-Frame model along with the help of CRNN, because DESED is quite small data for finetuning ATST-Frame? but then Who is Teacher and who is Student here in this case, cause I am getting confused by looking at results for both stages has student and teacher.

I am stuck with lot of questions, Please help, Thanks in advance!!

  1. to decode the class names, I have import the class_dict dictionary in the inference.py file and each class name is assigned with an index of row in the sed_results matrix.
  2. yes, this work is focus on finetuning the pretrained model for the small-scale DESED dataset.
  3. as for you confusion about the student and teacher: a. the student and teacher in this work are concepts from the MeanTeacher method, a semi-supervised method. So both student and teacher are referring the two models in the MeanTeacher method. You could refer to the original work of the MeanTeacher, but basically, the teacher model is just the exponential moving average (EMA) of the student model.

    b. therefore, if we use the MeanTeacher semi-supervised method, there would be a student and a teacher model. So, in stage 1, we use MeanTeacher, there are a student and a teacher, but the teacher in the stage1 is the EMA of the student in the stage1. And in stage 2, we also use MeanTeacher, so there are also a student and a teacher in the stage2. And the teacher model in stage 2 is the EMA of the student model in the stage2.

    c. using MeanTeacher for SED stems from the JiaKai's work, which is a winning system in 2018 DCASE challenge. Thereafter, the student and teacher models always appearred in the SED systems, because they all used the MeanTeacher methods.

    d. specifically in this work, we find the previous semi-supervised methods are not only useful for the small models but also helpful for fine-tuning the pretrained models.

HeChengHui commented 3 months ago

@SaoYear Thank you for your inference code. I tried using my own audio with stage_2_wo_external and i got the following results image

is it supposed to do this? the audio has parts of people talking.

magicalvoice commented 3 months ago

Hi @SaoYear Thank you, I understood. after stage 2 training i.e fine-tuning both CRNN and ATST-Frame, can I only use CRNN weights differently, is there a way? if yes, how.

SaoYear commented 3 months ago

Hi @SaoYear Thank you, I understood. after stage 2 training i.e fine-tuning both CRNN and ATST-Frame, can I only use CRNN weights differently, is there a way? if yes, how.

@magicalvoice I never tried to do that. But I suppose that, only using the CRNN part of the ATST-SED shall not be better than a CRNN trained from scratch.

If you want to do that, you could just comment the ATST features and the merge layer MLP. And feed the CNN output to the RNN directly.

The CNN trained in ATST-SED is regarded as a compensation for some local features that ignored by FrameATST. And the RNN in the ATST-SED is traiend to learn the fused features from both FrameATST and CNN. If you want to use just the CRNN part of the entire model, the performance of both CNN and RNN would be weaken and therefore the overall performance would be weaken.

SaoYear commented 3 months ago

@SaoYear Thank you for your inference code. I tried using my own audio with stage_2_wo_external and i got the following results image

is it supposed to do this? the audio has parts of people talking.

Hi @HeChengHui , would you mind to post the wav file? There could be some problems with the inference process.

HeChengHui commented 3 months ago

@SaoYear mixed.zip

the audio is >10s but the code seems to handle it by splitting and overlap.

SaoYear commented 2 months ago

Hi @HeChengHui , thanks for sharing the wav.

The splitting and overlapping are the intention of the inference. I have fixed some problems in the original inference code:

  1. the model is set to evaluation model (model.eval()) after loaded.
  2. the visualization quality of sed results are improved, the class labels are added in the plot, as requested by @magicalvoice .

Now the inference looks fine.

According to the audio clip you provided, the SED results look like: image

magicalvoice commented 2 months ago

@SaoYear are you logging validation loss too ? where can I get that

SaoYear commented 2 months ago

@SaoYear are you logging validation loss too ? where can I get that

@magicalvoice Sorry for the late response. The logging of validation loss is implemented in the ultra_sed_trainer.py line 477-491.

BTW, you could view it on the tensorboard, using the command: cd YOUR_LOG_DIR tensorboard --logdir ./ --bind_all