XinhaoMei / DCASE2021_task6_v2

Code for CVSSP submission to DCASE 2021 Task 6
34 stars 5 forks source link

missing pickle file #7

Open Compil3-sudo opened 1 year ago

Compil3-sudo commented 1 year ago

Hi, I am getting this error: FileNotFoundError: [Errno 2] No such file or directory: 'data/Clotho/pickles/456/train_keywords_dict_pred_5.p'

what could cause this ? there is no train_keywords_dict_pred_x.p file. Is it missing or did I do something wrong ?

Compil3-sudo commented 1 year ago

Also I would like to ask how the pickle dictionaries and the words_list files are generated ? Is there a way to generate captions for audio files, that do not have this kind of pickle files ? for example would it be possible to use this model with different datasets to only generate predicted captions without a list of words or keywords ?

XinhaoMei commented 1 year ago

Hi, I am getting this error: FileNotFoundError: [Errno 2] No such file or directory: 'data/Clotho/pickles/456/train_keywords_dict_pred_5.p'

what could cause this ? there is no train_keywords_dict_pred_x.p file. Is it missing or did I do something wrong ?

Hi, thanks for your interests.

First, pickle files for keywords are under data/Clotho/pickles/456/, files without 'pred' are groundtruth keywords for three sets, and those with 'pred' are predicted keywords and you can directly use these to train an audio captioning system. You may need to correct the file names in the trainer script.

Also I would like to ask how the pickle dictionaries and the words_list files are generated ? Is there a way to generate captions for audio files, that do not have this kind of pickle files ? for example would it be possible to use this model with different datasets to only generate predicted captions without a list of words or keywords ?

Second, you don't need to use keywords by setting it to false in the settings file. So the model just expects an audio clip as input. The words_list.p file is the vocabulary, which is required for decoding.

Compil3-sudo commented 1 year ago

Thank you for your explanation. So from what I understand, it's not possible to decode the predicted captions without having a vocabulary file, right ? and I assume that words_list.p file is generated by using the captions ? so the system can't predict captions for new audio files without captions ?

XinhaoMei commented 1 year ago

We use the training data to get the vocabulary, which contains all possible words the model can generate. At each time step, we get a logit from the model and apply softmax to get the probablity over the vocanulary and then sample a word according to that vocabulary.

When you have new audio files, the model can generate captions, and the words are sampled from that vocabulary.