Betswish commented 11 months ago

[Task Name] Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models

Multilingual large-scale Pretrained Language Models (PLMs) have been shown to learn considerable amounts of factual knowledge from the training corpora. However, large variations are observed in the extent to which this knowledge generalizes across different languages. With the ultimate goal of ensuring that users with different language backgrounds obtain consistent feedback from the same model, we study the cross-lingual consistency (CLC) of factual knowledge in various multilingual PLMs. To this end, we resplit the existing mLAMA (Kassner et al., 2021) to construct a new benchmark BMLAMA where instances for each language are balanced. Additionally, we propose a new Ranking-based Consistency (RankC) metric to evaluate knowledge consistency across languages independently from accuracy. We conduct an in-depth analysis of the determining factors for CLC, both at model level and at language-pair level. Among other results, we find that average CLC is low across different PLMs. Moreover, increasing model size leads to higher factual probing accuracy in most languages, but does not improve cross-lingual consistency. All code and data will be released at Github.

Authors

Jirui Qi j.qi@rug.nl
Raquel Fernández raquel.fernandez@uva.nl
Arianna Bisazza a.bisazza@rug.nl

Usage

Our evaluation function should be run in any other way than the default way since we focus on assessing the generation of factual knowledge in the multilingual PLMs with our proposed RankC metric. Our task can be described in three steps:

We first probe the multilingual PLMs with queries of factual knowledge in different languages (BMLAMA-17 and BMLAMA-53)
Based on the probabilities of candidates for each query, we sorted the candidate set from the highest to the lowest. (Provided in test_sample.jsonl)
We calculate the CLC between two languages with our proposed RankC metric and re-implement it in the format_example() function. The RankC score for the given sample is stored in the 'target' of the return terms. For more results, see 'bloom_3b_CLC.png'.

Implementation

evaluate_predictions(): It's not applicable to our task, since we focus on assessing the generation of factual knowledge in the multilingual PLMs with our proposed RankC metric.
format_example(): We re-implement the function to show how factual knowledge is generalized between a language pair in a PLM.

Checklist:

[x] I and my co-authors agree that, if this PR is merged, the code will be available under the same license as the genbench_cbt repository.
[x] Prior to submitting, I have ran the GenBench CBT test suite using the genbench-cli test-task tool.
[x] I have read the description of what should be in the doc.md of my task, and have added the required arguments.
[x] I have submitted or will submit an accompanying paper to the GenBench workshop.

kazemnejad commented 10 months ago

@Betswish Please check the tests. It seems they're failing.

Betswish commented 10 months ago

@kazemnejad The required packages and usage examples are all updated now. The code passed make fix-style and make check-quality locally. But when running genbench-cli test-task --id cross_lingual_consistency, it shows the error below because we split the raw dataset into two languages.

>       assert set(task_sets) == set(datasets_raw.keys())
E       AssertionError: assert {'test'} == {'en', 'es'}
E         Extra items in the left set:
E         'test'
E         Extra items in the right set:
E         'en'
E         'es'
E         Use -v to get more diff

tests/test_task.py:79: AssertionError

GenBench / genbench_cbt_2023

[Task Submission] Cross Lingual Consistency (`cross_lingual_consistency`) #39

[Task Name] Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models

Authors

Usage

Implementation

Checklist: