boun-tabi / NLI-TR

59 stars 5 forks source link
🇹🇷 Türkçe için tıklayınız.

📜 NLI-TR

The Natural Language Inference in Turkish (NLI-TR) is a set of two large scale datasets that were obtained by translating the foundational NLI corpora (SNLI and MultiNLI) using Amazon Translate. The English sentences of the datasets can be accessed from the original corpus by using their common identifier key (pairID).

The characteristics of the datasets can be reviewed in here and the details of the NLI task can be reviewed in the lectures videos of CS224U.

📜 SNLI-TR

The SNLI-TR 1.0 and SNLI-TR 1.1 (~44MB, zip) are the Turkish translation (NMT) of the original SNLI 1.0 (~100MB, zip). The only difference between the version 1.0 and 1.1 is that the latter includes an additional key field (translation_annotations) containing the evaluations of translations for some of the examples.

An example from SNLI:

{
        "annotator_labels": [
            "neutral"
        ],
        "captionID": "4688994030.jpg#3",
        "gold_label": "neutral",
        "pairID": "4688994030.jpg#3r1n",
        "sentence1": "A medical worker wearing a mask in the hospital.",
        "sentence1_binary_parse": "( ( ( A ( medical worker ) ) ( wearing ( ( a mask ) ( in ( the hospital ) ) ) ) ) . )",
        "sentence1_parse": "(ROOT (NP (NP (DT A) (JJ medical) (NN worker)) (VP (VBG wearing) (NP (NP (DT a) (NN mask)) (PP (IN in) (NP (DT the) (NN hospital))))) (. .)))",
        "sentence2": "A woman is in the hosptial working.",
        "sentence2_binary_parse": "( ( A woman ) ( ( is ( in ( the ( hosptial working ) ) ) ) . ) )",
        "sentence2_parse": "(ROOT (S (NP (DT A) (NN woman)) (VP (VBZ is) (PP (IN in) (NP (DT the) (JJ hosptial) (NN working)))) (. .)))"
}

The corresponding Turkish translation in SNLI-TR:

{
        "annotator_labels": [
            "neutral"
        ],
        "captionID": "4688994030.jpg#3",
        "gold_label": "neutral",
        "pairID": "4688994030.jpg#3r1n",
        "sentence1": "Hastanede maske takan bir sağlık görevlisi.",
        "sentence2": "Hastanede çalışan bir kadın var."
}

🏷 SNLI-TR License

SNLI-TR licensed under the same terms as SNLI which is Creative Commons Attribution-ShareAlike 4.0 International License.


📜 MultiNLI-TR

The MultiNLI-TR 1.0 and MultiNLI-TR 1.1 (~72MB, zip) are the Turkish translation (NMT) of the original MultiNLI 1.0 corpus (~216MB, zip). The only difference between the version 1.0 and 1.1 is that the latter includes an additional key field (translation_annotations) containing the evaluations of translations for some of the examples.

An example from MultiNLI:

{
        "annotator_labels": [
            "neutral"
        ],
        "genre": "telephone",
        "gold_label": "neutral",
        "pairID": "48889n",
        "promptID": "48889",
        "sentence1": "and uh then you'd be willing to give up your job to stay home and with or stay with the children",
        "sentence1_binary_parse": "( and ( uh ( then ( you ( 'd ( be ( willing ( to ( ( give up ) ( your ( job ( to ( ( ( stay ( ( home and ) with ) ) or ) ( stay ( with ( the children ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )",
        "sentence1_parse": "(ROOT (FRAG (CC and) (NP (NP (NNP uh)) (SBAR (S (ADVP (RB then)) (NP (PRP you)) (VP (MD 'd) (VP (VB be) (ADJP (JJ willing) (S (VP (TO to) (VP (VB give) (PRT (RP up)) (NP (PRP$ your) (NN job) (S (VP (TO to) (VP (VP (VB stay) (UCP (ADVP (RB home)) (CC and) (PP (IN with)))) (CC or) (VP (VB stay) (PP (IN with) (NP (DT the) (NNS children)))))))))))))))))))",
        "sentence2": "Is your dream to stay at home?",
        "sentence2_binary_parse": "( ( ( Is your ) ( dream ( to ( stay ( at home ) ) ) ) ) ? )",
        "sentence2_parse": "(ROOT (SQ (VBZ Is) (NP (PRP$ your)) (NP (NP (NN dream)) (S (VP (TO to) (VP (VB stay) (PP (IN at) (NP (NN home))))))) (. ?)))"
}

The corresponding Turkish translation in MultiNLI-TR:


{
        "annotator_labels": [
            "neutral"
        ],
        "genre": "telephone",
        "gold_label": "neutral",
        "pairID": "48889n",
        "promptID": "48889",
        "sentence1": "Ve o zaman evde kalmak ya da çocuklarla kalmak için işinden vazgeçersin.",
        "sentence2": "Hayaliniz evde kalmak mı?"
}

🏷 MultiNLI-TR License

MultiNLI-TR is licensed under the same terms as MultiNLI which is described in the MultiNLI paper.

:heavy_check_mark: Annotations

We included the annotations for the annotated examples inside the translation_annotations key field as shown with an example from SNLI-TR.

{
        "annotator_labels": [
            "entailment"
        ],
        "captionID": "6925887658.jpg#3",
        "gold_label": "entailment",
        "pairID": "6925887658.jpg#3r1e",
        "sentence1": "İki futbolcu topu almak için yarışıyor.",
        "sentence2": "İki futbolcu bir futbol maçında oynuyor",
        "translation_annotations": {
            "annotator_ids": [
                1,2,5,8,10
            ],
            "annotator_labels": [
                "entailment","entailment","entailment","entailment","entailment"
            ],
            "translation_scores": [
                5,5,4,5,5
            ],
            "gold_label": "entailment"
        }
}

The descriptions of the keys inside translation_annotations are listed below.

The corresponding pair in the SNLI dataset for comparison.

{
   "annotator_labels":[
      "entailment"
   ],
   "captionID":"6925887658.jpg#3",
   "gold_label":"entailment",
   "pairID":"6925887658.jpg#3r1e",
   "sentence1":"Two soccer players are vying for the ball.",
   "sentence1_binary_parse":"( ( Two ( soccer players ) ) ( ( are ( vying ( for ( the ball ) ) ) ) . ) )",
   "sentence1_parse":"(ROOT (S (NP (CD Two) (NN soccer) (NNS players)) (VP (VBP are) (VP (VBG vying) (PP (IN for) (NP (DT the) (NN ball))))) (. .)))",
   "sentence2":"Two soccer players are playing in a soccer match",
   "sentence2_binary_parse":"( ( Two ( soccer players ) ) ( are ( playing ( in ( a ( soccer match ) ) ) ) ) )",
   "sentence2_parse":"(ROOT (S (NP (CD Two) (NN soccer) (NNS players)) (VP (VBP are) (VP (VBG playing) (PP (IN in) (NP (DT a) (NN soccer) (NN match)))))))"
} 

📚 Resources

📖 Download NLI-TR

🔗 Links

🤗 HuggingFace datasets

from datasets import load_dataset

snli_tr_dataset = load_dataset('nli_tr', 'snli_tr')
multinli_tr_dataset = load_dataset('nli_tr', 'multinli_tr')

🔬 Reproducibility

You can find all code, models and samples of the input data here. Please feel free to reach out to us if you have any specific questions.

✒ Citation

Emrah Budur, Rıza Özçelik, Tunga Güngör and Christopher Potts. 2020. Data and Representation for Turkish Natural Language Inference. To appear in Proceedings of EMNLP. [pdf] [bib]

@inproceedings{budur-etal-2020-data,
    title = "Data and Representation for Turkish Natural Language Inference",
    author = "Budur, Emrah and
      \"{O}z\c{c}elik, R{\i}za and
      G\"{u}ng\"{o}r, Tunga",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics"
}

❤ Acknowledgment

This research was supported by the AWS Cloud Credits for Research Program (formerly AWS Research Grants). We thank Alara Dirik, Almira Bağlar, Berfu Buyüköz, Berna Erden, Fatih Mehmet Güler, Gökçe Uludoğan, Gözde Aslantaş, Havva Yüksel, Melih Barsbey, Melike Esma İlter, Murat Karademir, Ramazan Pala, Selen Parlar, Tuğçe Ulutuğ, Utku Yavuz for their annotation support and vital contributions. We are grateful also to Stefan Schweter and Kemal Oflazer for sharing the dataset that BERTurk was trained on, and Omar Khattab, Dallas Card, Yiwei Luo, and many other distinguished researchers from Stanford NLP Lab for their valuable advice and discussion, also the anonymous reviewers for their insightful comments and feedbacks.

📧 Contact

Please feel free to contact Emrah Budur and Rıza Özçelik for any questions, comments and feedbacks.

📢 Announcements

🎮 (2020-11-18) Word Contest! Surprise 🎁 is waiting for the lucky winner who is among those replying/retweeting this tweet with the life-saving words inside our Github page 👆 until 23:59 Nov 20, 2020 (UTC)!

💬 (2020-11-18) Gather.town session of our paper will be held in Room B (Session 5B) on November 18, 2020 between 18:00-20:00 (UTC) at EMNLP 2020. Looking forward to your questions on our paper!

🎯 (2020-06-01) A new version of our work entitled as "Data and Representation for Turkish Natural Language Inference" was submitted to EMNLP 2020.

🐣 (2020-09-14) Our paper has been accepted as a long paper at EMNLP 2020 🙂 Grateful to everyone who supported our work along the way 🤗