alvations / disambiguate

Disambiguate is a tool for training and using state of the art neural WSD models
https://arxiv.org/abs/1905.05677
0 stars 0 forks source link

disambiguate: Neural Word Sense Disambiguation Toolkit

This repository contains a set of easy-to-use tools for training, evaluating and using neural WSD models.

This is the implementation used in the article Sense Vocabulary Compression through the Semantic Knowledge of WordNet for Neural Word Sense Disambiguation, written by Loïc Vial, Benjamin Lecouteux and Didier Schwab.

Table of Contents

Dependencies

To install Python, Java and Maven, you can use the package manager of your distribution (apt-get, pacman...).

To install PyTorch, please follow the instructions on this page.

To install AllenNLP (necessary if using ELMo), please follow the instructions on this page.

To install HuggingFace's pytorch-pretrained-BERT (necessary if using BERT), please follow the instructions on this page.

:metal: New :metal: To install HuggingFace's transformers (necessary if using any other language model supported by the transformer library, but also includes BERT)), please follow the instructions on this page.

To install UFSAC, simply:

Compilation

Once the dependencies are installed, please run ./java/compile.sh to compile the Java code.

Sense mappings

We provide the two sense mappings used in our paper as standalone files in the directory sense_mappings.

The files consist of 117659 lines (one line by synset): the left-hand ID is the original synset ID, and the right-hand is the ID of the associated group of synsets.

The file hypernyms_mapping.txt results from the sense compression method through hypernyms. The exact algorithm that was used is located in the method getSenseCompressionThroughHypernymsClusters() of the file java/src/main/java/getalp/wsd/utils/WordnetUtils.java.

The file all_relations_mapping.txt results from the method through all relationships. The exact algorithm that was used is located in the method getSenseCompressionThroughAllRelationsClusters() of the file java/src/main/java/getalp/wsd/utils/WordnetUtils.java.

Using pre-trained models

We are currently providing our best models trained on the SemCor and the WordNet Gloss Corpus, using BERT embeddings, with the vocabulary compression through the hypernymy/hyponymy relationships applied, as described in our article.

Model URL
SemCor + WNGC, hypernyms, single https://zenodo.org/record/3759385
SemCor + WNGC, hypernyms, ensemble https://zenodo.org/record/3759301

Once the data are downloaded and extracted, you can use the following commands (replace $DATADIR with the path of the appropriate folder):

Disambiguating raw text

Evaluating a model

Description of the arguments:

Optional arguments:

UFSAC corpora are available in the UFSAC repository. If you want to reproduce our results, please download UFSAC 2.1 and you will find the SemCor (file semcor.xml, the WordNet Gloss Tagged (file wngt.xml) and all the SemEval/SensEval evaluation corpora that we used (files raganato_*.xml).

Training new WSD models

Preparing data

Call the ./prepare_data.sh script with the following main arguments:

Training a model (or an ensemble of models)

Call the ./train.sh script with the following main arguments:

Citation

If you want to reference our paper, please use the following BibTeX snippet:


@InProceedings{vial-etal-2019-sense,
  author      = {Vial, Lo{\"i}c and Lecouteux, Benjamin and Schwab, Didier},
  title       = {{Sense Vocabulary Compression through the Semantic Knowledge of WordNet for Neural Word Sense Disambiguation}},
  booktitle   = {{Proceedings of the 10th Global Wordnet Conference}},
  year        = {2019},
  address     = {Wroclaw, Poland},
}