Benchmarking - Githubissues

kylebgorman commented 1 year ago

We should add a benchmarking suite. I have reserved a separate repo, CUNY-CL/yoyodyne-benchmarks for this.

Here are a list of shared tasks (and related papers) from which we can pull data:

Merhav & Ash (2018) transliteration
Other transliteration tasks:
- Dakshina
- Japanese/English reverse transliteration
SIGMORPHON 2016 inflection
SIGMORPHON 2017 inflection
SIGMORPHON 2018 inflection
SIGMORPHON 2020 g2p
SIGMORPHON 2021 g2p
New York-Boulder abstractness data

The benchmark itself is a collection of two tables.

A "KPI" table, per dataset/language. E.g., "SIGMORPHON 2021 g2p Bulgarian".
A "study" table, per dataset/language/architecture. E.g., "Transducer ensemble on SIGMORPHON 2021 g2p Bulgarian".

A single script should compute all KPI statistics and dump it out as a TSV. This table should include:

training set size
dev set size
test set size
average input string length
average output string length
whether it has features

While one could imagine a single script which performs all studies, this is probably not wise. Rather, these should be grouped into separate scripts based on their functionality (though it may make sense for there to be multiple studies per study script; e.g., we coudl have one script per dataset/language pair). The results can be dumped out in some structured format (JSON), and a separate script can be used to aggregate the non-ragged portions of all the JSON study reports into a single TSV. This table should include:

dataset
language
model type
GPU models (e.g., torch.cuda.get_device_name(number))
wall clock time during training
wall clock time during inference
development accuracy of best model
test accuracy of best model
hyperparameters for best model (this is the ragged part)
model size, in KB
model size, in # of trainable parameters

Then, a separate script is used to aggregate the non-ragged portions of the extant study observations.

Studies should include:

the worst of 5 randomly initialized models
the best of 5 randomly initialized models
the media of 5 randomly initialized models
a voting ensemble of 5 randomly initialized models
possibly: heterogeneous ensembles of different architectures

Putting this all together should make it easy for us to win relevant shared tasks. ;)

This is related to #5, as the refactoring there should make this much easier. This is also related to #15; we may want to use the sweeping interface for the benchmarks.

bonham79 commented 1 year ago

Had this thought the other night, what about the Google normalization tasks for English and Russian? (Not that we don't have enough already...)

kylebgorman commented 1 year ago

Our way of doing that (e.g., in Zhang et al. 2019 and earlier papers) was way more constrained than generalized sequence-to-sequence learning, so I think we’d have to basically have to implement an alternative “task”, possibly with multiple layers of prediction, and this seems like a big lift to me.

CUNY-CL / yoyodyne

Benchmarking #16