NeuroBench / neurobench

Benchmark harness and baseline results for the NeuroBench algorithm track.
https://neurobench.readthedocs.io
Apache License 2.0
46 stars 11 forks source link

MSWC pre-training task is very easy #190

Open V0XNIHILI opened 4 months ago

V0XNIHILI commented 4 months ago

I tried my hand at pre-training a few of the models I use on the MSWC base training set. I found that it was really easy to overfit (100% train accuracy) on this data (even with random time shifts):

image

This model has 126k parameters and performs equally well to the M5 model in the paper. I think with a bit more tuning, I could get 98% validation accuracy with this model. Meaning that for bigger models, 99/100% validation accuracy should be within close reach...

What are your thoughts?

jasonlyik commented 4 months ago

@Maxtimer97 Pinging Maxime, what do you think about the difficulty of the pre-train subset?

@V0XNIHILI Good point, the pre-train task is perhaps too easy given that all of the words are distinct and have many syllables. On the image datasets used for FSCIL, pre-train accuracy is usually around 80%. It does remain though that maintaining high accuracy through the continual learning portion is unsolved, and also that we monitor model complexity via network size and operations as well as accuracy.

Do you have any other ideas for how we can select a subset of words from the MSWC set?

Maxtimer97 commented 4 months ago

Yes this is something we were already discussing before. One element is this is already a bit harder if you take an online recurrent model vs a CNN. I get up to 95% test accuracy with a GRU with similar number of MACs. Then the question of explanability of the dataset building with taking the longest words for now. And I guess keyword classification nowadays is an easy problem but we maybe want to push for exploration with very compact models (and of course considering the FSCIL which is still challenging).

Else we could make it harder by selecting the smallest words but we go away from promoting the processing of temporal information.

For me this is OK because the target is not to convert the deep learning community to our dataset but rather the neuromorphic and tinyML one and considering the average model people play with right now (local plasticity, bio-plausible neurons, super sparse spiking, etc.), I think they will still have a challenge on this task.

Maxtimer97 commented 4 months ago

If we want to push for a real challenging temporal task with continual learning I would be up to brainstorm on that, but then we need to think of really long sequences and what transformers and state space models can do on such a task :)

V0XNIHILI commented 4 months ago

Selecting words differently could help a lot I guess (for example only selecting 'short' keywords) because on 10+2 class Google Speech Commands, the model from my comment only gets 95% validation accuracy. Even 5-15M parameter transformers only get 98.x% accuracy.

For a harder task away from speech, I do not have a concrete proposal right now. Maybe it would be interesting to consult Neurobench's industry collaborators to see what they think could be a good task?

For non real-life sequential tasks, building on top of miniImageNet could be an option.

Maxtimer97 commented 4 months ago

Ok I mean my remaining concern was the question of having a less temporal task with shorter keywords but I guess we are not testing long range memory anyway.

So if you just want to move to shorter keywords @V0XNIHILI I guess you could go ahead, generate the new dataset and see how the baseline models perform. I can help if something doesn't work out of the box with the current models (actually I noticed LIF models maybe work even better than adLIF for the prototype approach, so it would be something to try for the shorter keywords).

The remaining point then is just what it means for the paper @jasonlyik, can we just resubmit an adapted version with the shorter keywords and updated results?

jasonlyik commented 4 months ago

@Maxtimer97 In terms of the paper, we are still waiting for reviews, and especially early on before too many people are using the dataset it is the best time to change, if we want to change.

V0XNIHILI commented 4 months ago

So, how should we proceed then? For selecting shorter keywords, I think I will be able to do this. Then: what is the timeline/deadline for such a change?

jasonlyik commented 4 months ago

We should have until the final submission version is due, I will let you know when that may be. For now, could you look into shorter keywords within the next two or three weeks?

jasonlyik commented 3 months ago

@V0XNIHILI The reviews have returned. None were specifically concerned with the high starting accuracy on the MSWC task, but it is still something we can upgrade. We are planning to return the revision within the next 6-10 weeks. Are you still looking into the shorter keywords dataset changes?

V0XNIHILI commented 3 months ago

Okay, that sounds good. Yes I'd still like to do it but I haven't had the chance yet. I'll try to do it before the end of this month. If I can't manage timewise, I'll update you ASAP.

jasonlyik commented 2 months ago

Hey @V0XNIHILI, do you have any updates on this?

V0XNIHILI commented 2 months ago

Hey @jasonlyik, I have a small update. To stay true to the multilingual aspect of the original dataset, I kept five European languages (20 words each, total 100 words) and selected the shorted (in terms of characters) words with > 1000 samples + removing all words with equal spelling. Also I replaced Catalan by Spanish for the base set and Spanish for Turkish in the continual learning set as Spanish and Catalan are quite similar to each other in terms of pronunciation/intonation.

I guess what remains then is to retrain on this set and see what the performance impact is?

import json

path = 'metadata.json'
data = json.loads(open(path, 'r').read())

# Remove for each key in data, the key 'filenames' to make dict ligher
for key in data:
    if 'filenames' in data[key]:
        del data[key]['filenames']
def get_words(data, langs, count, min_samples, non_overlap_words = None):
    wc_selected = {}
    all_words = non_overlap_words or []
    all_words = all_words[:]

    for lang in langs:
        wc_selected[lang] = {}
        wc = data[lang]['wordcounts']

        # Only select words that have a higher word count than 1000
        for key in wc:
            if wc[key] > min_samples:
                wc_selected[lang][key] = wc[key]

        sorted_words = sorted(wc_selected[lang], key=len)

        for word in sorted_words:
            if word in all_words:
                sorted_words.remove(word)

        selected_words = sorted_words[:count]
        all_words.extend(selected_words)

        if len(selected_words) < count:
            raise ValueError('Not enough words for language: ' + lang)

        # Sort by shortest word
        wc_selected[lang] = selected_words

    return wc_selected
base_langs = ['en', 'de', 'fr', 'es', 'it']
base_words = get_words(data, base_langs, 20, 1000)
all_base_words = [word for lang in base_words for word in base_words[lang]]

cont_words = get_words(data, ['fa', 'tr', 'ru', 'cy', 'it', 'eu', 'pl', 'eo', 'pt', 'nl'], 10, 500, all_base_words)

This yields:

>>> base_words

{'en': ['has',
  'the',
  'big',
  'car',
  'sat',
  'for',
  'age',
  'saw',
  'cut',
  'eye',
  'yet',
  'box',
  'hey',
  'six',
  'lie',
  'got',
  'boy',
  'son',
  'job',
  'set'],
 'de': ['gab',
  'mir',
  'sie',
  'tun',
  'ihn',
  'mit',
  'ich',
  'ihr',
  'oft',
  'tag',
  'und',
  'gar',
  'ist',
  'ort',
  'für',
  'aus',
  'ihm',
  'vor',
  'gut',
  'wer'],
 'fr': ['rue',
  'six',
  'lui',
  'non',
  'bas',
  'fin',
  'les',
  'oui',
  'peu',
  'moi',
  'dit',
  'été',
  'qui',
  'mes',
  'des',
  'une',
  'dès',
  'bon',
  'mon',
  'ici'],
 'es': ['así',
  'uno',
  'era',
  'día',
  'año',
  'han',
  'por',
  'hay',
  'con',
  'ese',
  'san',
  'una',
  'del',
  'muy',
  'las',
  'fue',
  'que',
  'sus',
  'vez',
  'los'],
 'it': ['era',
  'sua',
  'che',
  'due',
  'per',
  'tra',
  'sul',
  'più',
  'suo',
  'nel',
  'dei',
  'gli',
  'dal',
  'loro',
  'sono',
  'come',
  'anni',
  'dopo',
  'alla',
  'aveva']}
>>> cont_words

{'fa': ['صدا', 'دهم', 'یاد', 'کمک', 'خون', 'قبل', 'های', 'این', 'هیچ', 'کنه'],
 'tr': ['yüz',
  'ise',
  'iki',
  'çok',
  'beş',
  'var',
  'bin',
  'bir',
  'için',
  'ancak'],
 'ru': ['нам', 'эта', 'как', 'эти', 'его', 'мне', 'эту', 'все', 'уже', 'нет'],
 'cy': ['awr', 'mwy', 'dim', 'ble', 'nhw', 'gan', 'eto', 'chi', 'ond', 'fod'],
 'it': ['cui', 'nei', 'mai', 'sua', 'può', 'poi', 'due', 'sia', 'tre', 'tra'],
 'eu': ['bat', 'eta', 'edo', 'dut', 'ere', 'dio', 'lan', 'zer', 'oso', 'den'],
 'pl': ['jak', 'dla', 'gdy', 'też', 'był', 'ale', 'cię', 'czy', 'pod', 'nad'],
 'eo': ['sen', 'min', 'unu', 'kaj', 'tio', 'laŭ', 'oni', 'ĝin', 'ili', 'tro'],
 'pt': ['foi', 'mas', 'são', 'não', 'com', 'tem', 'seu', 'ela', 'ele', 'meu'],
 'nl': ['wat', 'bij', 'een', 'aan', 'dat', 'het', 'van', 'uit', 'dan', 'was']}
jasonlyik commented 2 months ago

Looks good, I hope the shortest words aren't too difficult - especially with many of them being single syllable and also because MSWC has automated alignment the words may not be very clear.

I'd guess we should shoot for something around the difficulty of the common FSCIL benchmarks, which looks to be around 75%-85% on base classes / session 0 (here's a latest FSCIL paper).

I'll be finalizing the revision at the start of next month, if you could have any modification by the end of this month that would be great!

V0XNIHILI commented 2 months ago

Hey Jason, to give you an update: today I will schedule training on the above shortest words, but I will also try words with maximum length 5, 7 and 9. I can hopefully post the results of that tonight CET time.

V0XNIHILI commented 1 month ago

Validation accuracy progression:

image

Final training accuracy:

image

@jasonlyik sorry for my late reply, hopefully the results can still be useful...

Legend:

Trained with the same settings as the pre-training done for the paper (same network, same lr, etc.).

I used the same languages as before (base_langs = ['en', 'de', 'fr', 'es', 'it']) and (cont_langs = ['fa', 'tr', 'ru', 'cy', 'it', 'eu', 'pl', 'eo', 'pt', 'nl']) although I didnt try the continual learning yet. I can try to run that as well.

3 characters only is definitely the hardest but aligns well with a goal of 75-85% validation accuracy on base classes. Furthermore, shorter words are also more realistic as keywords (i.e. "now", "yes", "go", ...) compared to longer words (i.e. "canada", "international", "development").

Let me know what you think and how we should proceed!

jasonlyik commented 1 month ago

Looks good Douwe, can we see how the prototypes work for the continual learning part as well?

At this point I'm not sure we have enough time to push this new change as a major revision into the article, and I'd like to see more tests as well. The forward plans with the harness are to allow for many benchmarks to be included, so that one could find them all in one location and do whatever tests they would like. So, I think this more challenging keyword subset would be nice to add as something like an MSWC-short dataloader, which would enrich the FSCIL benchmark as a whole.

V0XNIHILI commented 1 month ago

image image

@jasonlyik here are the graphs for the continual learning part. Seems like exactly 5 characters strikes a good trade-off. What do you think? Also, I quite like the idea of MSWC-short!

Maxtimer97 commented 1 month ago

Hey Douwe, sorry for not reacting ealier but this looks very nice!

Are these results on the CNN or the SNN? And so you have a big drop while converting the models to prototypes?

Then I agree, it would be nice to have it as an alternative version of the task called MSWC-short! And for best impact we should also pre-fix the samples and create a new dataset for huggingface :)

And maybe there would be space for supplementary materials on the paper to descibe it quickly?

Anyway thank you for the work!

V0XNIHILI commented 1 month ago

On the CNN, the M5!

@jasonlyik anything I can mean for the supplementary material of the paper with this data?

jasonlyik commented 1 month ago

Great to see that this moves more in line with standard FSCIL tasks!

I have added these couple of sentences to the Methods section under the Keyword FSCIL task section:

Using shorter keyword subsets can lead to greater challenge in both base language training and continual class learning. The full list of chosen words for the task presented in this paper (longest words), as well as other potential subsets of short words for the same languages, can be found with dataset documentation through the harness.

@V0XNIHILI Could you package the script/data and send in a PR so that we can add it to the harness as well as auto-download from HuggingFace?

V0XNIHILI commented 1 month ago

Super, thanks a lot @jasonlyik! Yep, will do this!!