NVIDIA / audio-flamingo

PyTorch implementation of Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities.
MIT License
173 stars 10 forks source link

Results mismatch when evaluating on NSynth #8

Closed jasonppy closed 2 months ago

jasonppy commented 2 months ago

Hi Zhifeng,

I observed significantly mismatch on numbers when evaluating audio-flamingo on NSynth test set.

 
Task metric Prior SotA AF reported AF reproduced
NS instrument CLS Acc 78.8 77.1 48.3
NS quality CLS F1 46.3 66.7 0
NS source CLS Acc 60.1 78.7 47.7

Below is the prefix, prompt, example ground truth, and example model output
 
prefix prompt Example gt Example model output
NS instrument The task is instrument classification. what is the instrument of this music.\nOPTIONS:\n - organ.\n - mallet.\n - brass.\n - vocal.\n - keyboard.\n - reed.\n - flute.\n - guitar.\n - bass.\n - synth_lead.\n - string bass bass(NOTE: sometimes the model will not follow instruction and output things like “wind instrument and woodwind instrument”)
NS quality The task is quality classification. what are the qualities of the music. dark the music is a combination of volume, frequency, and timbre.(NOTE: the model always outputs a sentence, and therefore scores 0 for F1)
NS source The task is source classification What is the source of this music.\nOPTIONS:\n - synthetic.\n - electronic.\n - acoustic synthetic acoustic(NOTE: sometimes the model put a period at the end like “acoustic.”, my parsing code will remove that)

Do you spot any issues?

Thanks for your time!

zhifengkongnv commented 2 months ago

The prefix is "The task is music information retrieval", and prompt is "this music note is". The model will print "produced by ...", following the template in Table 1 in the Pengi paper.

jasonppy commented 2 months ago

Thanks! having change the prefix to "The task is music information retrieval", and the prompt to "this music note is", the model does not produce instrument or sources or quality in required format as expected, some of the example output:

GT: source: synthetic, instrument: bass, model output: the music is a combination of volume, frequency, and timbre. GT: source: electronic, instrument: keyboard, model output: the music is described as dynamic and full-bodied.

zhifengkongnv commented 2 months ago

This looks strange. I just tested the model outputs and it can follow the instructions.

{'name': 'NSynth/nsynth-test/audio/bass_synthetic_009-017-025.wav', 'prefix': 'The task is music information retrieval.', 'prompt': 'this music note is'}
Audio Flamingo: 'produced by bass, pitch 20, velocity 127, source synthetic, and having qualities like bright, distortion, long release'

{'name': 'NSynth/nsynth-test/audio/keyboard_electronic_098-023-050.wav', 'prefix': 'The task is music information retrieval.', 'prompt': 'this music note is'}
Audio Flamingo: 'produced by keyboard, pitch 22, velocity 75, source electronic, and having qualities like long release'
jasonppy commented 2 months ago

Apologies, I've been using the wrong prompt for this one. When evaluate the output, do you extract the results with regular expression or by checking whether the correct answer is in the sentence?

zhifengkongnv commented 2 months ago

Here's the parsing code fyi

    def parse_output(output):
        # example output is 
        # "is produced by keyboard, pitch 102, velocity 100, source acoustic, and having qualities like percussive, reverb"
        # is produced by acoustic mallet, pitch 27, velocity 25 and having qualities like percussive 
        get_single = lambda keyword: output.split(keyword)[-1].split(', ')[0].strip().lower().replace('-', ' ') if keyword in output else None
        instrument = get_single('produced by')
        if instrument.split(' ')[0] in ['acoustic', 'electronic', 'synthetic']:
            source = instrument.split(' ')[0]
            instrument = ' '.join(instrument.split(' ')[1:])
        else:
            source = get_single('source')

        get_single2 = lambda keyword: output.split(keyword)[-1].split(' ')[0].strip().lower().replace(',', '') if keyword in output else None
        pitch = get_single2('pitch ')
        velocity = get_single2('velocity ')

        get_multiple = lambda keyword: output.split(keyword)[-1].strip().lower().replace('-', ' ').split(', ') if keyword in output else None
        qualities = get_multiple('and having qualities like')

        return {
            'instrument': instrument, 
            'pitch': pitch, 
            'velocity': velocity, 
            'source': source, 
            'qualities': qualities, 
        }