Closed jasonppy closed 2 months ago
Do you spot any issues?
Thanks for your time!
The prefix is "The task is music information retrieval", and prompt is "this music note is". The model will print "produced by ...", following the template in Table 1 in the Pengi paper.
Thanks! having change the prefix to "The task is music information retrieval", and the prompt to "this music note is", the model does not produce instrument or sources or quality in required format as expected, some of the example output:
GT: source: synthetic, instrument: bass, model output: the music is a combination of volume, frequency, and timbre. GT: source: electronic, instrument: keyboard, model output: the music is described as dynamic and full-bodied.
This looks strange. I just tested the model outputs and it can follow the instructions.
{'name': 'NSynth/nsynth-test/audio/bass_synthetic_009-017-025.wav', 'prefix': 'The task is music information retrieval.', 'prompt': 'this music note is'}
Audio Flamingo: 'produced by bass, pitch 20, velocity 127, source synthetic, and having qualities like bright, distortion, long release'
{'name': 'NSynth/nsynth-test/audio/keyboard_electronic_098-023-050.wav', 'prefix': 'The task is music information retrieval.', 'prompt': 'this music note is'}
Audio Flamingo: 'produced by keyboard, pitch 22, velocity 75, source electronic, and having qualities like long release'
Apologies, I've been using the wrong prompt for this one. When evaluate the output, do you extract the results with regular expression or by checking whether the correct answer is in the sentence?
Here's the parsing code fyi
def parse_output(output):
# example output is
# "is produced by keyboard, pitch 102, velocity 100, source acoustic, and having qualities like percussive, reverb"
# is produced by acoustic mallet, pitch 27, velocity 25 and having qualities like percussive
get_single = lambda keyword: output.split(keyword)[-1].split(', ')[0].strip().lower().replace('-', ' ') if keyword in output else None
instrument = get_single('produced by')
if instrument.split(' ')[0] in ['acoustic', 'electronic', 'synthetic']:
source = instrument.split(' ')[0]
instrument = ' '.join(instrument.split(' ')[1:])
else:
source = get_single('source')
get_single2 = lambda keyword: output.split(keyword)[-1].split(' ')[0].strip().lower().replace(',', '') if keyword in output else None
pitch = get_single2('pitch ')
velocity = get_single2('velocity ')
get_multiple = lambda keyword: output.split(keyword)[-1].strip().lower().replace('-', ' ').split(', ') if keyword in output else None
qualities = get_multiple('and having qualities like')
return {
'instrument': instrument,
'pitch': pitch,
'velocity': velocity,
'source': source,
'qualities': qualities,
}
Hi Zhifeng,
I observed significantly mismatch on numbers when evaluating audio-flamingo on NSynth test set.