LoieSun / Auto-ACD

code for A Large-scale Dataset for Audio-Language Representation Learning
Creative Commons Zero v1.0 Universal
10 stars 0 forks source link

caption quality #4

Open alexanderwerning opened 2 weeks ago

alexanderwerning commented 2 weeks ago

Hi, looking at the captions I noticed some things: Some captions contain raw probabilities, check for "0." or "(", numbers in brackets (together about 6%) or "probability" (about 2.5%) or "label" (4.6%); given proportions are relative to the full training data. As you probably still have the full data used to generate the captions, maybe you can regenerate these captions and release them as a v2 or something?

A lot of captions contain the word "creating" (33%), separating a literal from a more high level description, have you tested how this influences the model learning a higher level audio understanding?

I am still in the process of downloading the audio data for the dataset, so I could not test this by training a model yet myself.

What do you think? Thanks!

LoieSun commented 1 week ago

Hi, thank you for your interest. We will provide the pipeline code soon, and there are currently no plans to develop v2.

To avoid the impact on the model, we augmented the caption during training with a 25% random mask on the words. https://github.com/LoieSun/Auto-ACD/blob/08a2fd9dd3e4f2e81bc9cadff727b7aff1945d6e/laion_clap/training/data.py#L612