Closed Betswish closed 10 months ago
@Betswish Please check the tests. It seems they're failing.
@kazemnejad The required packages and usage examples are all updated now. The code passed make fix-style
and make check-quality
locally.
But when running genbench-cli test-task --id cross_lingual_consistency
, it shows the error below because we split the raw dataset into two languages.
> assert set(task_sets) == set(datasets_raw.keys())
E AssertionError: assert {'test'} == {'en', 'es'}
E Extra items in the left set:
E 'test'
E Extra items in the right set:
E 'en'
E 'es'
E Use -v to get more diff
tests/test_task.py:79: AssertionError
[Task Name] Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models
Multilingual large-scale Pretrained Language Models (PLMs) have been shown to learn considerable amounts of factual knowledge from the training corpora. However, large variations are observed in the extent to which this knowledge generalizes across different languages. With the ultimate goal of ensuring that users with different language backgrounds obtain consistent feedback from the same model, we study the cross-lingual consistency (CLC) of factual knowledge in various multilingual PLMs. To this end, we resplit the existing mLAMA (Kassner et al., 2021) to construct a new benchmark BMLAMA where instances for each language are balanced. Additionally, we propose a new Ranking-based Consistency (RankC) metric to evaluate knowledge consistency across languages independently from accuracy. We conduct an in-depth analysis of the determining factors for CLC, both at model level and at language-pair level. Among other results, we find that average CLC is low across different PLMs. Moreover, increasing model size leads to higher factual probing accuracy in most languages, but does not improve cross-lingual consistency. All code and data will be released at Github.
Authors
j.qi@rug.nl
raquel.fernandez@uva.nl
a.bisazza@rug.nl
Usage
Our evaluation function should be run in any other way than the default way since we focus on assessing the generation of factual knowledge in the multilingual PLMs with our proposed RankC metric. Our task can be described in three steps:
Implementation
Checklist:
genbench-cli test-task
tool.