Results mismatch when evaluating on NSynth

jasonppy commented 2 months ago

Hi Zhifeng,

I observed significantly mismatch on numbers when evaluating audio-flamingo on NSynth test set.

	Task	metric	Prior SotA	AF reported	AF reproduced
NS instrument	CLS	Acc	78.8	77.1	48.3
NS quality	CLS	F1	46.3	66.7	0
NS source	CLS	Acc	60.1	78.7	47.7

Below is the prefix, prompt, example ground truth, and example model output	prefix	prompt	Example gt	Example model output
NS instrument	The task is instrument classification.	what is the instrument of this music.\nOPTIONS:\n - organ.\n - mallet.\n - brass.\n - vocal.\n - keyboard.\n - reed.\n - flute.\n - guitar.\n - bass.\n - synth_lead.\n - string	bass	bass(NOTE: sometimes the model will not follow instruction and output things like “wind instrument and woodwind instrument”)
NS quality	The task is quality classification.	what are the qualities of the music.	dark	the music is a combination of volume, frequency, and timbre.(NOTE: the model always outputs a sentence, and therefore scores 0 for F1)
NS source	The task is source classification	What is the source of this music.\nOPTIONS:\n - synthetic.\n - electronic.\n - acoustic	synthetic	acoustic(NOTE: sometimes the model put a period at the end like “acoustic.”, my parsing code will remove that)

zhifengkongnv commented 2 months ago

The prefix is "The task is music information retrieval", and prompt is "this music note is". The model will print "produced by ...", following the template in Table 1 in the Pengi paper.

jasonppy commented 2 months ago

Thanks! having change the prefix to "The task is music information retrieval", and the prompt to "this music note is", the model does not produce instrument or sources or quality in required format as expected, some of the example output:

GT: source: synthetic, instrument: bass, model output: the music is a combination of volume, frequency, and timbre. GT: source: electronic, instrument: keyboard, model output: the music is described as dynamic and full-bodied.

zhifengkongnv commented 2 months ago

This looks strange. I just tested the model outputs and it can follow the instructions.

{'name': 'NSynth/nsynth-test/audio/bass_synthetic_009-017-025.wav', 'prefix': 'The task is music information retrieval.', 'prompt': 'this music note is'}
Audio Flamingo: 'produced by bass, pitch 20, velocity 127, source synthetic, and having qualities like bright, distortion, long release'

{'name': 'NSynth/nsynth-test/audio/keyboard_electronic_098-023-050.wav', 'prefix': 'The task is music information retrieval.', 'prompt': 'this music note is'}
Audio Flamingo: 'produced by keyboard, pitch 22, velocity 75, source electronic, and having qualities like long release'

jasonppy commented 2 months ago

Apologies, I've been using the wrong prompt for this one. When evaluate the output, do you extract the results with regular expression or by checking whether the correct answer is in the sentence?

zhifengkongnv commented 2 months ago

Here's the parsing code fyi

    def parse_output(output):
        # example output is 
        # "is produced by keyboard, pitch 102, velocity 100, source acoustic, and having qualities like percussive, reverb"
        # is produced by acoustic mallet, pitch 27, velocity 25 and having qualities like percussive 
        get_single = lambda keyword: output.split(keyword)[-1].split(', ')[0].strip().lower().replace('-', ' ') if keyword in output else None
        instrument = get_single('produced by')
        if instrument.split(' ')[0] in ['acoustic', 'electronic', 'synthetic']:
            source = instrument.split(' ')[0]
            instrument = ' '.join(instrument.split(' ')[1:])
        else:
            source = get_single('source')

        get_single2 = lambda keyword: output.split(keyword)[-1].split(' ')[0].strip().lower().replace(',', '') if keyword in output else None
        pitch = get_single2('pitch ')
        velocity = get_single2('velocity ')

        get_multiple = lambda keyword: output.split(keyword)[-1].strip().lower().replace('-', ' ').split(', ') if keyword in output else None
        qualities = get_multiple('and having qualities like')

        return {
            'instrument': instrument, 
            'pitch': pitch, 
            'velocity': velocity, 
            'source': source, 
            'qualities': qualities, 
        }

NVIDIA / audio-flamingo

Results mismatch when evaluating on NSynth #8