CristinaGHolgado / old-french-lemmatization

Methods to lemmatize Old French using different tools
1 stars 0 forks source link
corpus lemmatisation lemmatization medievalfrench oldfrench pie treetagger udpipe

Old French lemmatization

This repository includes a set of scripts for the lemmatization of medieval french using 4 different tools : TreeTagger, LGeRM, UDPipe (CRAN R packgage) and NLP Pie. Two different combination systems for lemmatizacion from these tools as provided as well. The goal is to assess the lemmatization as well as analizing the advantages (and limits) these tools may offer, especially for each part-of-speech and unknown words. A couple of tests files are provided (first sentence of each). The work contained in this repository was carried out within the framework of a master's interpship (M2 Linguistics-NLP at Strasbourg University) at the ATILF laboratory as part of the projet ANR Profiterole (PRocessing Old French Instrumented TExts for the Representation Of Language Evolution, ANR-16-CE38-0010). PROFITEROLE is a project financed by the French National Research Agency (ANR) focused into the syntactic aspects of Old french.

Additional information

Repository of the project : https://gitlab.huma-num.fr/lemmatisation-fro/bfm-lem
TALN 2021 article : https://gitlab.huma-num.fr/lemmatisation-fro/bfm-lem/-/blob/master/doc/taln-recital2021.pdf

Data source and corpus features

The annotated texts used for training and evaluation are part of the BFMGOLDLEM corpus and gather a total of 431 144 tagged and lemmatized forms. It is part of the BFM (Base de Français Médiéval), an online database of french medieval texts covering the period from the 9th to the 15th century whose total number of word occurrences amounts to 4.7 million.

The corpus used for this project is composed of two sources of which a predominant one belongs to only one author (Chrétien de Troyes). It is thus an important corpus, but not very diversified (a single author, a single manuscript, a single genre). It has its own reference system of lemmas, which correspond for the most part to the entries of the dictionary Tobler-Lommatzsch (TL) dictionary, which favors older forms. The rest of the corpus has been lemmatized in the framework of the BFM and is much more diversified. Files use CONLL-U format and include tokenized inflected forms, morphological labels based on Cattex 2009 and lemmas (Holgado, Lavrentev, et Constant 2021).

External libraries

pandas (1.0.1) scikit-learn (0.23.0) numpy (1.18.1)

Lemma standarization

Before training and tagging, the source lemmas (DECT lemmas) have been converted into DMF lemmas using the FROLEX lexicon (table of equivalences between lemmas) according to the following procedure:

DECT lemmas are converted to DMF lemmas.
→ If not avaliable equivalences in the lexicon, they are converted to TL lemmas.
→ Otherwise, they are converted to GDF or, ultimately, to BFM lemmas.

Not converted lemmas are saved into an external file. Previous versions of the lemmas (those converted) are kept in the last column of the CONLL-U file which contains the source lemma information (e.g.: XmlId=w_CharretteKu_1|LemmaSrc=DECT if none (text source and id, lexicon source), XmlId=w_CharretteKu_7|LemmaSrc=DMF|LemmaDECT=voloir if converted (text source and id, lexicon source updated, previous lemma)).
In the standardized corpus, the 424 836 lemmas are thus DMF (98.54%), 4 512 DECT (1.04%), 965 BFM (0.22%), 801 TL (0.18%) and 30 GDF (0.01%).

Folder's description

a) Normalisation_lemmes/ folder include all files from this step :

files


folders


b) Models/ folder include the models generated by each tool for each test.

Preprocessing corpus for train & tagging

After lemma standardization, corpus files are divided into a 10 tests&train split and pre-processed. More detailed info about the produre and texts features are avaliable at the following site : Protocole de tests and section 3.2.

Train data

A different preprocessing step is followed regarding the tool specifications in the input data. For instance, each tool requires the following training data structure :

Test data

Since the corpus files are already tokenized, we skip the tokenization step. For illustrative purposes, the structure is the following :
1 Cil cil PRON PROdem PronType=Dem ... XmlId=w_CligesKu_1|LemmaSrc=DMF|LemmaDECT=cel
... ... ... ... ... ... ... ...

4. Training & Annotation