LR-POR / MorphoBr

Resources for morphological analysis of Portuguese
Apache License 2.0
24 stars 4 forks source link
brazilian-portuguese finite-state-automata finite-state-morphology finite-state-transducer finite-state-transducers morphological-analyser morphological-analysis morphology natural-language-processing syntactic-parsing

+TITLE: MorphoBr - resources for morphological analysis of Portuguese

+AUTHOR: Leonel F. de Alencar, Alexandre Rademaker, Bruno Cuconato

Unless otherwise stated, the present resources are derived from the dictionaries of the:

Entries are separated by class, each in its own directory. Entries are divided in small files, so that they can be previewed on GitHub and processed in less powerful computers. The repository structure is meant for development: if you would like a copy of the data, you should go to our [[https://github.com/LFG-PTBR/MorphoBr/releases][releases page]].

Code (and its documentation) employed in the resource's development can be found in the =tools/= directory.

The tagset used in the above projects was converted to a more mnemonical one which generally follows the notational conventions of the descriptive linguistic and finite-state morphology literatures, see the =TAGSET= file.

In order to use MorphoBr resources for morphological analysis in the context of syntactic parsing (e.g. with PorGram), they must be compiled into finite-state transducers. The easier way is to load and compile it with [[http://fomafst.github.io][Foma]]:

+begin_src bash

% ./compile.sh
% foma -f compile.foma
% echo "fortinho" | flookup -i morphobr.bin
fortinho    forte+N+DIM+M+SG
fortinho    forte+A+DIM+M+SG

% echo "comprei-o" | flookup -i morphobr.bin
comprei-o   comprar+V+ele.ACC.3.M.SG+PRF+1+SG

+end_src

For that, you need to:

  1. change the MAX_STACK value in the int_stack.c file in Foma source code to at least 9097152. Compile Foma.

See https://github.com/mhulden/foma/issues/146 and https://github.com/mhulden/foma/issues/130

Python modules as well as XFST and bash scripts for performing this task are available in the =tools/= folder. Inadequacies and errors found in the source dictionaries were corrected using these tools. See the respective incode documentation for a detailed description of the changes made to the source entries.

See our [[https://github.com/LFG-PTBR/MorphoBr/releases][releases page]].

See https://github.com/LFG-PTBR/MorphoBr/wiki#publicações. To cite this resource, please use http://www.periodicos.letras.ufmg.br/index.php/textolivre/article/view/14294 (first in the publications list).