This repository accompanies the paper Fast and unsupervised methods for multilingual cognate clustering by Rama, Wahle, Sofroniev, and Jäger. The repository contains both the data and the source code used in the paper's experiments.
The code is developed using Python 3.5 but later versions should also do as long as the dependencies are satisfied.
# clone this repository
git clone https://github.com/evolaemp/online_cognacy_ident
cd online_cognacy_ident
# you do not need to create a virtual environment if you know what you are
# doing; remember that the code is written in python3
virtualenv meta/venv
source meta/venv/bin/activate
# install the dependencies
# it is important to use the versions specified in the requirements file
pip install -r requirements.txt
# check the unit tests
python -m unittest discover online_cognacy_ident
If you run into difficulties, please make sure you have tried the setup in a fresh virtual environment before opening an issue.
Installing python-igraph
on Windows may not work via pip install python-igraph
. Please use the appropriate windows binaries from Christoph
Gohlkes' site.
# activate the virtual env if it is not already
source meta/venv/bin/activate
# ensure the reproducibility of the results
export PYTHONHASHSEED=42
# use train.py to train models
python train.py --help
# use run.py to apply trained models on datasets
python run.py --help
# use eval.py to evaluate the algorithms' output
python eval.py --help
A dataset should be in csv format. You can specify the csv dialect using the
--dialect-input
option, possible values are excel
, excel-tab
, and unix
.
If this is omitted, the script will try to guess the dialect by looking at the
file extension.
A dataset should have a header with at least the following columns: doculect
or language
, concept
or gloss
, and asjp
or transcription
. Column name
detection is case-insensitive. If there are two or more words tied to a single
gloss in a given doculect, all but the first are ignored.
The datasets used in the paper's experiments can be found in the datasets
directory.
dataset | language families | transcription | source |
---|---|---|---|
abvd |
Austronesian | ipa | Greenhill et al, 2008 |
afrasian |
Afro-Asiatic | asjp | Militarev, 2000 |
bai |
Sino-Tibetan | ipa | Wang, 2006 |
chinese_1964 |
Sino-Tibetan | ipa | Běijīng Dàxué, 1964 |
chinese_2004 |
Sino-Tibetan | ipa | Hóu, 2004 |
huon |
Trans-New Guinea | asjp | McElhanon, 1967 |
ielex |
Indo-European | ipa | Dunn, 2012 |
japanese |
Japonic | ipa | Hattori, 1973 |
kadai |
Tai-Kadai | asjp | Peiros, 1998 |
kamasau |
Torricelli | asjp | Sanders, 1980 |
lolo_burmese |
Sino-Tibetan | asjp | Peiros, 1998 |
mayan |
Mayan | asjp | Brown, 2008 |
miao_yao |
Hmong-Mien | asjp | Peiros, 1998 |
mixe_zoque |
Mixe-Zoque | asjp | Cysouw et al, 2006 |
mon_khmer |
Austroasiatic | asjp | Peiros, 1998 |
ob_ugrian |
Uralic | ipa | Zhivlov, 2011 |
tujia |
Sino-Tibetan | ipa | Starostin, 2013 |
Please note that you should use the --ipa
flag when running the algorithms on
any IPA-transcribed dataset, including the ones found in the datasets
dir.
If you have fish shell installed, you could invoke ./run_all.fish
which
runs both algorithms on all the datasets, saves the results in the output
dir
and prints the evaluation scores to stdout.
The datasets are published under a Creative Commons Attribution-ShareAlike 4.0
International License. The source code is published under the MIT License
(see the LICENSE
file).