GenBench / genbench_cbt_2023

The official Genbench Collaborative Benchmarking Task repository 2023 (Archived)
Other
14 stars 18 forks source link

[Task Submission] Cross Lingual Consistency (`cross_lingual_consistency`) #39

Closed Betswish closed 10 months ago

Betswish commented 11 months ago

[Task Name] Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models

Multilingual large-scale Pretrained Language Models (PLMs) have been shown to learn considerable amounts of factual knowledge from the training corpora. However, large variations are observed in the extent to which this knowledge generalizes across different languages. With the ultimate goal of ensuring that users with different language backgrounds obtain consistent feedback from the same model, we study the cross-lingual consistency (CLC) of factual knowledge in various multilingual PLMs. To this end, we resplit the existing mLAMA (Kassner et al., 2021) to construct a new benchmark BMLAMA where instances for each language are balanced. Additionally, we propose a new Ranking-based Consistency (RankC) metric to evaluate knowledge consistency across languages independently from accuracy. We conduct an in-depth analysis of the determining factors for CLC, both at model level and at language-pair level. Among other results, we find that average CLC is low across different PLMs. Moreover, increasing model size leads to higher factual probing accuracy in most languages, but does not improve cross-lingual consistency. All code and data will be released at Github.

Authors

Usage

Our evaluation function should be run in any other way than the default way since we focus on assessing the generation of factual knowledge in the multilingual PLMs with our proposed RankC metric. Our task can be described in three steps:

Implementation

Checklist:

kazemnejad commented 10 months ago

@Betswish Please check the tests. It seems they're failing.

Betswish commented 10 months ago

@kazemnejad The required packages and usage examples are all updated now. The code passed make fix-style and make check-quality locally. But when running genbench-cli test-task --id cross_lingual_consistency, it shows the error below because we split the raw dataset into two languages.

>       assert set(task_sets) == set(datasets_raw.keys())
E       AssertionError: assert {'test'} == {'en', 'es'}
E         Extra items in the left set:
E         'test'
E         Extra items in the right set:
E         'en'
E         'es'
E         Use -v to get more diff

tests/test_task.py:79: AssertionError