Closed magicalvoice closed 2 months ago
Hi,
Thanks for your interests!
If I understand you correctly, what you want is to use the model to inference on some unlabeled audio clips.
This is the same process as we submit the evaluation results for the DCASE challenge. And this function is integrated within the DCASE baseline codes as well as in the codes of ATST-SED, supporting by the pytorch-lightning.
To do this:
You might notice that, in config.yaml
file (e.g., \train\configs\stage1.yaml
file in this repo), there are only eval_folder
and eval_folder_44k
and no eval_tsv
.
So first you could modify these two terms to your own paths. If you have audios in 16kHz, you can just change eval_folder
to your own path. Otherwise, you might need to change eval_folder_44k
and the script would automatically resample the audios to 16kHz
Run the evaluation, e.g., if you want to use fine-tuned model, run:
train_stage2.py --gpus 0, --eval_from_checkpoint YOUR_PRETRAINED_CKPT_PATH
The system would inference the data automatically and the predicted results would be stored in your exp
folder.
Hope these help : )
Thank you so much @SaoYear. This really helped, Much appreciate!!
Hi @SaoYear,
I might ask silly doubts, but I am getting this error, when tried testing in the same way you described. Although I have ensured all the input file to be 10s each in duration and file sizes also same, I don't know why I am getting this error, and how to fix it.
Please help me @SaoYear
(dcase2023) empuser@server:~/ATST-SED-Scripts/ATST-SED/train$ python train_stage2.py --gpus 1 --eval_from_checkpoint exp/stage2/version_0/epoch=209-step=23100.ckpt /nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/train
/nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/train/train_stage2.py(505)
() -> configs, args, test_model_state_dict, evaluation = prepare_run() (Pdb) c loaded model: exp/stage2/version_0/epoch=209-step=23100.ckpt at epoch: 209 Global seed set to 42 32 Loading ATST from: /nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/atst_as2M.ckpt Loading student from: /nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/train/exp/stage1/version_0/epoch=39-step=4400.ckpt Model loaded Loading ATST from: /nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/atst_as2M.ckpt Loading teacher from: /nfs/engine/empuser/ATST-SED-Scripts/ATST-SED/train/exp/stage1/version_0/epoch=39-step=4400.ckpt Model loaded GPU available: False, used: False TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs Trainer(limit_train_batches=1.0)
was configured so 100% of the batches per epoch will be used..Trainer(limit_val_batches=1.0)
was configured so 100% of the batches will be used..Trainer(limit_test_batches=1.0)
was configured so 100% of the batches will be used.. [rank: 0] Global seed set to 42 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
distributed_backend=gloo All distributed processes registered. Starting with 1 processes
Testing DataLoader 0: 0%| | 0/32 [00:00<?, ?it/s]torch.Size([1, 1, 2505, 128])
torch.Size([1, 128, 626, 1])
shape of x original: torch.Size([1, 1001, 768])
shape of pos original torch.Size([1, 250, 768])
Traceback (most recent call last):
File "train_stage2.py", line 505, in
Hi, it seems like you made some modifications on the audio_transformer.py
(Since your line 146 in audio_transformer.py
is different from the line in this repo). And now the length of your positional embeddings and patch embeddings are not aligned (one is 1001 and the other is 250).
I will attempt to explain what's happening in the function prepare_tokens
, which might help you to debug:
cut
mode for positional embedding, get 250-length trainable positional embeddings and add them with the patch embeddings (line 143-144 here)I would recommend you to print the shape of x
and pos
before the addition, so that you can know whose length is wrong, both should be 250.
Hi, it seems like you made some modifications on the
audio_transformer.py
(Since your line 146 inaudio_transformer.py
is different from the line in this repo). And now the length of your positional embeddings and patch embeddings are not aligned (one is 1001 and the other is 250).I will attempt to explain what's happening in the function
prepare_tokens
, which might help you to debug:
- We apply a linear patching to transform each continuous 4 frames to a patch embedding, this would make the temporal resolution from 1001 to 1001 // 4 = 250;
- We use
cut
mode for positional embedding, get 250-length trainable positional embeddings and add them with the patch embeddings (line 143-144 here)I would recommend you to print the shape of
x
andpos
before the addition, so that you can know whose length is wrong, both should be 250.
- eval_folder
I have also encountered this problem. May I ask how many seconds of audio should be in "eval_folder"? My test set is 10s, it seems that the shape cannot match
audio, atst_feats, labels, padded_indxs, filenames = batch
print(audio.shape) #[1, 441882]
sed_feats = self.mel_spec(audio) #should be [1, 128, 624]
atst_feats = self.atst_norm(atst_feats) #should be [1, 64, 500]
print(sed_feats.shape) # torch.Size([1, 128, 626])
print(atst_feats .shape) # torch.Size([1, 64, 1001])
The shape of your waveforms is incorrect. You should resample them to 16kHz.
To do so, you could refer to resample_data_generate_durations
function (actually the resample_folder
func in local.resample_folder
) in the DESED baseline code. This function will automatically create a folder for you. And then you could change the eval_folder
to the path of the created folder. The evaluation should work.
The shape of your waveforms is incorrect. You should resample them to 16kHz.
To do so, you could refer to
resample_data_generate_durations
function (actually theresample_folder
func inlocal.resample_folder
) in the DESED baseline code. This function will automatically create a folder for you. And then you could change theeval_folder
to the path of the created folder. The evaluation should work.
Hello! Thank you very much for your help! But I still have some questions:
_external.ckpt
is broken. But the _wo_external.ckpt
performs very similarly. You could use the _wo_external.ckpt
one. Actually, we did not pay much attention on finetuning the _external.ckpt
, this checkpoint is just to use for fair comparisons with other methods in our paper.
- Yes, this ATST-SED model is designed for DESED dataset. These 10 classes are exactly the classes defined by the DESED dataset (DCASE challenge task 4). If you want to recognize more classes such as AudioSet Strong, you should refer to ATST repo here, instead of ATST-SED.
- The
_external.ckpt
is broken. But the_wo_external.ckpt
performs very similarly. You could use the_wo_external.ckpt
one. Actually, we did not pay much attention on finetuning the_external.ckpt
, this checkpoint is just to use for fair comparisons with other methods in our paper.
Hello! Your reply was very helpful to me. I carefully reviewed the code for "ATST Frame". If I want to obtain frame level sound event detection for my test dataset, should I change and run this part of the code?
######################## Frame-level downstream tasks
DESED please see sehll/downstream/finetune_dcase Strongly labelled AudioSet please see shell/downstream/finetune_as_strong #########################
The inference code is "audiossl/audiossl/methods/atsframe/downstream/train_as_strong. py" The inference model is "atstframe_base.ckpt" May I ask if my guess is correct? Thanks~~!!
is there a way to try the model inference on different durations of audio? maybe cutting the audios into frames before using the model or changing something within the model to support longer audio?
Yeah, you could refer to what we've done in the ATST-RCT system, last paragraph of section 3.
Quick summary:
(L - W) / K + 1 (represented as N)
W-second audio clips by this processOkay thank you! Is there an implemetation of this already ?
You could refer to ATST-RCT repo, I just uploaded a neccessary file.
Please see the test_step
in the trainer file for this part of implementation.
Yeah, there are three steps:
pull_back_preds
function in utils.py
)Hi, have anyone written separate code for just inference, like loading the model, trained weights and running it on 10 s audios to get inference per file [may be some post processing too].
If anyone has done it, please help me with that, how to do it.
Thank you.
@martineghiazaryan @magicalvoice I will write a quick inference script
will finish it by this weak ; )在 2024年7月11日,20:28,Martin Yeghiazaryan @.***> 写道: @SaoYear hey any news from the script?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
Hey guys, sorry for the late but I have update an inference file in the latest commit.
You could use the inference file by:
python -m inference
And the path of the waveform to be inferenced could be changed inside the code.
If you have any other problem, please let me know.
@SaoYear First of all, Thank you so much for your kind help!!
I have another question, How to interpret this result inference_result.png, like True False for each chunk based on a threshold, but which class it belongs ?
Also, I want to clarify a doubt, I read your paper "Fine-tune the pretrained ATST model for sound event detection", it is basically training and finetuning ATST-Frame model along with the help of CRNN, because DESED is quite small data for finetuning ATST-Frame? but then Who is Teacher and who is Student here in this case, cause I am getting confused by looking at results for both stages has student and teacher.
I am stuck with lot of questions, Please help, Thanks in advance!!
@SaoYear First of all, Thank you so much for your kind help!!
I have another question, How to interpret this result inference_result.png, like True False for each chunk based on a threshold, but which class it belongs ?
Also, I want to clarify a doubt, I read your paper "Fine-tune the pretrained ATST model for sound event detection", it is basically training and finetuning ATST-Frame model along with the help of CRNN, because DESED is quite small data for finetuning ATST-Frame? but then Who is Teacher and who is Student here in this case, cause I am getting confused by looking at results for both stages has student and teacher.
I am stuck with lot of questions, Please help, Thanks in advance!!
inference.py
file and each class name is assigned with an index of row in the sed_results
matrix.as for you confusion about the student and teacher:
a. the student
and teacher
in this work are concepts from the MeanTeacher method, a semi-supervised method. So both student
and teacher
are referring the two models in the MeanTeacher method. You could refer to the original work of the MeanTeacher, but basically, the teacher
model is just the exponential moving average (EMA) of the student
model.
b. therefore, if we use the MeanTeacher semi-supervised method, there would be a student
and a teacher
model. So, in stage 1, we use MeanTeacher, there are a student
and a teacher
, but the teacher
in the stage1 is the EMA of the student
in the stage1. And in stage 2, we also use MeanTeacher, so there are also a student
and a teacher
in the stage2. And the teacher
model in stage 2 is the EMA of the student
model in the stage2.
c. using MeanTeacher for SED stems from the JiaKai's work, which is a winning system in 2018 DCASE challenge. Thereafter, the student
and teacher
models always appearred in the SED systems, because they all used the MeanTeacher methods.
d. specifically in this work, we find the previous semi-supervised methods are not only useful for the small models but also helpful for fine-tuning the pretrained models.
@SaoYear Thank you for your inference code. I tried using my own audio with stage_2_wo_external and i got the following results
is it supposed to do this? the audio has parts of people talking.
Hi @SaoYear Thank you, I understood. after stage 2 training i.e fine-tuning both CRNN and ATST-Frame, can I only use CRNN weights differently, is there a way? if yes, how.
Hi @SaoYear Thank you, I understood. after stage 2 training i.e fine-tuning both CRNN and ATST-Frame, can I only use CRNN weights differently, is there a way? if yes, how.
@magicalvoice I never tried to do that. But I suppose that, only using the CRNN part of the ATST-SED shall not be better than a CRNN trained from scratch.
If you want to do that, you could just comment the ATST features and the merge layer MLP. And feed the CNN output to the RNN directly.
The CNN trained in ATST-SED is regarded as a compensation for some local features that ignored by FrameATST. And the RNN in the ATST-SED is traiend to learn the fused features from both FrameATST and CNN. If you want to use just the CRNN part of the entire model, the performance of both CNN and RNN would be weaken and therefore the overall performance would be weaken.
@SaoYear Thank you for your inference code. I tried using my own audio with stage_2_wo_external and i got the following results
is it supposed to do this? the audio has parts of people talking.
Hi @HeChengHui , would you mind to post the wav file? There could be some problems with the inference process.
@SaoYear mixed.zip
the audio is >10s but the code seems to handle it by splitting and overlap.
Hi @HeChengHui , thanks for sharing the wav.
The splitting and overlapping are the intention of the inference. I have fixed some problems in the original inference code:
model.eval()
) after loaded.Now the inference looks fine.
According to the audio clip you provided, the SED results look like:
@SaoYear are you logging validation loss too ? where can I get that
@SaoYear are you logging validation loss too ? where can I get that
@magicalvoice Sorry for the late response. The logging of validation loss is implemented in the ultra_sed_trainer.py line 477-491.
BTW, you could view it on the tensorboard, using the command:
cd YOUR_LOG_DIR
tensorboard --logdir ./ --bind_all
Hi,
@SaoYear Thank you for the great work. I am new to the problem of SED, I have fine-tuned iwth own data, Now, I just want to test the final fine-tuned model with test audio files, not having ground truth for the same. Is there any script to do that, without the need of preparing the .tsv files with onset offset event label etc. in the format of DESED data.
Basically, how will I use model for completely unknown input audio. if you could tell me in steps, It would be really helpful.
Thank you so much in advance!