CUNY-CL / yoyodyne

Small-vocabulary sequence-to-sequence generation with optional feature conditioning
Apache License 2.0
25 stars 14 forks source link

Benchmarking #16

Open kylebgorman opened 1 year ago

kylebgorman commented 1 year ago

We should add a benchmarking suite. I have reserved a separate repo, CUNY-CL/yoyodyne-benchmarks for this.

Here are a list of shared tasks (and related papers) from which we can pull data:

The benchmark itself is a collection of two tables.

A single script should compute all KPI statistics and dump it out as a TSV. This table should include:

While one could imagine a single script which performs all studies, this is probably not wise. Rather, these should be grouped into separate scripts based on their functionality (though it may make sense for there to be multiple studies per study script; e.g., we coudl have one script per dataset/language pair). The results can be dumped out in some structured format (JSON), and a separate script can be used to aggregate the non-ragged portions of all the JSON study reports into a single TSV. This table should include:

Then, a separate script is used to aggregate the non-ragged portions of the extant study observations.

Studies should include:

Putting this all together should make it easy for us to win relevant shared tasks. ;)

This is related to #5, as the refactoring there should make this much easier. This is also related to #15; we may want to use the sweeping interface for the benchmarks.

bonham79 commented 1 year ago

Had this thought the other night, what about the Google normalization tasks for English and Russian? (Not that we don't have enough already...)

kylebgorman commented 1 year ago

Our way of doing that (e.g., in Zhang et al. 2019 and earlier papers) was way more constrained than generalized sequence-to-sequence learning, so I think we’d have to basically have to implement an alternative “task”, possibly with multiple layers of prediction, and this seems like a big lift to me.