Czech restaurant (and possibly DART) BLEU seems suspiciously low

tuetschek commented 3 years ago

BLEU of 2-3% is very low – maybe there's a problem with the references and/or the outputs (wrong order etc.)?

jordiclive commented 2 years ago

@tuetschek Yes, the BLEU for DART should be >45%, so it might be a good idea to update the v1 paper as it is a bug.

I've been looking at DART, E2E Clean, and WEBNLG+ 2020 as am familiar with them. I get close to zero scores when using a general submission. Not sure the exact issue, as I pass in the keys.

One thing I noticed was for the E2E Clean test and validation set, there are identical inputs, so there are 4693 examples, rather than 1847 with multiple references, which would affect the metric calculation.

tuetschek commented 2 years ago

@jordiclive That's weird... yeah 4693 sounds like the multiple references are ignored? Can you please send me the submission file you're trying this with?

jordiclive commented 2 years ago

@tuetschek I think for E2E it's to do with the datasets loading script. If you take datasets.load_dataset('GEM/e2e_nlg'), it is flattened, (1 ref each). So we'd have to edit https://huggingface.co/datasets/GEM/e2e_nlg/blob/main/e2e_nlg.py ...

I know it is your dataset :) and as I used it for a paper, I'm happy to try and fix it. If you agree, I think it should load as 1847 examples with varying no. of multiple refs? It might be easier to store the csvs like that. For example, I know for BigScience they are trying to use E2E Clean, but each 4693 is scored independently with one reference.

Thanks! For the submission script, I would be very interested why it doesn't work for these datasets. I will send you my general submission script via email, which I submitted recently, and the separate reference files, where it works as expected (including DART) if you pass them in and the predictions.

tuetschek commented 2 years ago

@jordiclive I'm sorry for non-responsiveness, I got piled under teaching and admin, now I'm trying to catch up a little. Are you still able to have a look at the HF loader for E2E? It would be great if you could fix it! I'll finally have a look at your email with the submission script.

tuetschek commented 2 years ago

@jordiclive OK it looks like the problem is different for each set:

For DART, it looks like GEM-metrics is currently set not to compute scores on DART at all. The corresponding line in the config is commented out. I don't know the reason for this unfortunately.
For E2E, it's exactly as you say – everything is single-ref and needs fixing, preferrably at Huggingface.
For WebNLG, the problem seems to be compatibility with Huggingface. GEM-metrics is set to load the references from Huggingface, and in the HF version, the WebNLG references file has target and references keys for each instance. The target key is single-ref, it's just the first element of references. GEM-metrics always just looks for target, references is an unknown key for it.

I'm not exactly sure how to solve DART & WebNLG – @sebastianGehrmann , do you know why DART is now not used at all? And what do you propose to do with WebNLG – do we start looking at the references key? Or do we fix it on HF as well? I don't see the point of having target and references in this way, it's just confusing and you should always want to use multi-ref 🤔.

jordiclive commented 2 years ago

@tuetschek No worries. Thanks for investigating!! lol quite coincidental.

jordiclive commented 2 years ago

@tuetschek I changed the E2E loading script here to use multi-references. @sebastianGehrmann can we merge this?

I investigated the target and references: For each loading script, you have to define the types of the dataset 'features' for all splits (there could be a way to define separate feature types for each subset). So that is probably why target andreferences were made, as for training you want target to be [str] and references List[str]. I actually see this in several of the GEM datasets. For training, the references are not to be used (empty lists) and for testing and validation, the target is not meant to be used.

I followed this format for the E2E. Therefore, I think we should write some GEM code to start looking at the references key? as otherwise, we would have to change lots of datasets on HF side and affect backward compat.

jordiclive commented 2 years ago

@tuetschek I made a patch fix that seems to solve the issue by looking for references key and using references when available #97.. hopefully, that will fix other datasets that use references key e.g. asset_turk as well as DART, webnlg.

For E2E, the problem stems from how dataset is stored, if here is merged, then the e2e test dataset also needs to be changed in references repo: with this

tuetschek commented 2 years ago

@jordiclive Thanks for fixing everything! Will you get notified when/if the E2E fix gets merged? Can I support the merge somehow 😀?

jordiclive commented 2 years ago

@tuetschek, I don't think I will, not sure how to create MRs for these 😅. It's two separate HF datasets repos: GEM/e2e_nlg for e2e fix and references to change the reference dataset.

tuetschek commented 2 years ago

@jordiclive I can see a "New Pull Request" button on the "Community" tab for both – is that the place? Not sure how you can transform your branch into the PR, but I guess it's possible to do using the command-line PR workflow?

jordiclive commented 2 years ago

Thanks! was struggling to change one file in references without downloading every single dataset. Hopefully can close this issue when https://huggingface.co/datasets/GEM/references/discussions/1 is resolved.

tuetschek commented 2 years ago

@jordiclive That's great, thanks! I guess I'll still have to have a look at Czech restaurant, but I hope to do that in the next few days 🙂

tuetschek commented 2 years ago

OK so far it looks like CS Restaurants has also problems with some files, so I opened another issue there (when the files are OK, the scores seem fine now).

GEM-benchmark / GEM-metrics

Czech restaurant (and possibly DART) BLEU seems suspiciously low #27