YenanLiu commented 4 days ago

Dear authors,

Thank you for your reply! So can I understand that your work cannot reproduce even if you release part of your pre-train model cause you cannot release your audio pkl, as you say the code is easy but you don't want to release the intermediate result. Even the code can be run that I set all queries active, but it just involves the image modality.

PeiwenSun2000 commented 3 days ago

1. audio pkl

So can I understand that your work cannot reproduce even if you release part of your pre-train model cause you cannot release your audio pkl, as you say the code is easy but you don't want to release the intermediate result.

This sentence is somewhat confusing because it includes six subjects, six verbs, and six objects, but I will do my best to help you. I'm not certain which pkl file you're referring to. Did you mean the npy file mentioned here? Apologies for any confusion. For audio feature extraction with VGGish described in Section 4.1 of the paper, please refer to this link. It is nothing but 3 lines of code.

If any confusion on the feature extraction, please try our previous AVS works for reference. https://github.com/GeWu-Lab/Generalizable-Audio-Visual-Segmentation https://github.com/GeWu-Lab/Stepping-Stones https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference

If you mean the classification and clustering, setting all queries as active can allow the entire project to run. It will aid in understanding this work.

2. Questions about image modality.

Even the code can be run that I set all queries active, but it just involves the image modality.

The active queries are defined by audio clustering and audio classification, as shown in Eq. (1) in the paper. The active queries are designed to solve audio priming bias, and do not involve the image modality.

YenanLiu commented 3 days ago

Thank you for your reply! Since my last question is closed by the author, I open a new issue. It's not the audio feature extracted from the VGGish. Line 105 in the code here suggests that we need to train a classification model to attain the pth file for each audio. The author gives a link about the details of Beat classification, however, it's difficult to ensure that we got the same classification results as we lack the training details. As I learned that audio classification results are not always good especially for the in-the-wild data. In the paper, the author suggests that the classification results are not trival for the final segmentation performance. I think it's best to release the training code or just release the pth file. In this way, we could utilize the released pretrained model to attain your superior performance in your paper.

PeiwenSun2000 commented 3 days ago

Thanks for your suggestion.

Yes indeed. Classification is not always good, but it still provides valuable preliminaries to the model.

I did not specifically design the classification network; it's a simple network with BEATS plus several linear layers, as far as I remember. I have changed institutions since May, but I will attempt to retrieve these files for the benefit of the community.

But before that, it is still encouraged for peers to at least try to do it on their own and fine-tune the model on top of our model. I don't think the performance of the classification is too important to affect the overall performance, as shown in the ablation in the paper.

Many thanks for your understanding.

GeWu-Lab / bias_in_AVS

About the pkl file #4

1. audio pkl

2. Questions about image modality.