LR-POR / MorphoBr

Resources for morphological analysis of Portuguese
Apache License 2.0
24 stars 4 forks source link

compiling a finite-state transducer from the dict files #130

Closed leoalenc closed 1 year ago

leoalenc commented 1 year ago

O objetivo desta issue é atualizar para Python 3 o script tools/fst/BuildSpacedText.py que converte arquivos de duas colunas com a extensão dict em arquivos de texto espaçado, de modo a permitir a compilação de um transdutor lexical a partir dos arquivos do MorphoBr. Atualmente, o referido script não funciona com Python 3.

leoalenc commented 1 year ago

@arademaker , para compilar todo o Morpho num transdutor lexical, evitando erro por ultrapassagem da quantidade máxima de memória, execute, por exemplo:

~/MorphoBr/nouns$ BuildFomaFSTFromPairs.sh nouns-*.dict

381.2 kB. 11248 states, 24305 arcs, 63096 paths. Writing to file tmp257530/nouns-a.dict.stxt.fst. 183.5 kB. 5535 states, 11644 arcs, 24197 paths. Writing to file tmp257530/nouns-b.dict.stxt.fst. 403.4 kB. 11783 states, 25719 arcs, 63416 paths. Writing to file tmp257530/nouns-c.dict.stxt.fst. 179.7 kB. 5955 states, 11412 arcs, 26226 paths. Writing to file tmp257530/nouns-d.dict.stxt.fst. 247.0 kB. 7386 states, 15720 arcs, 41435 paths. [...] Writing to file tmp257530/nouns-y.dict.stxt.fst. 39.2 kB. 1310 states, 2430 arcs, 4182 paths. Writing to file tmp257530/nouns-z.dict.stxt.fst. Value of i nouns-z.dict.stxt Value of i nouns-z.dict.stxt 381.2 kB. 11248 states, 24305 arcs, 63096 paths. 183.5 kB. 5535 states, 11644 arcs, 24197 paths. 403.4 kB. 11783 states, 25719 arcs, 63416 paths. 179.7 kB. 5955 states, 11412 arcs, 26226 paths. 247.0 kB. 7386 states, 15720 arcs, 41435 paths. 153.9 kB. 4617 states, 9755 arcs, 22215 paths. 139.5 kB. 4087 states, 8837 arcs, 20421 paths. 111.8 kB. 3939 states, 7066 arcs, 14254 paths. 134.2 kB. 4426 states, 8502 arcs, 20833 paths. 55.3 kB. 2016 states, 3452 arcs, 4605 paths. 7.1 kB. 258 states, 373 arcs, 275 paths. 131.9 kB. 4035 states, 8348 arcs, 17560 paths. 260.5 kB. 7989 states, 16577 arcs, 37385 paths. 78.5 kB. 2668 states, 4933 arcs, 9281 paths. 80.5 kB. 2661 states, 5060 arcs, 10570 paths. 329.7 kB. 10195 states, 21002 arcs, 48631 paths. 45.7 kB. 1584 states, 2835 arcs, 4547 paths. 160.9 kB. 4718 states, 10205 arcs, 24591 paths. 229.1 kB. 7262 states, 14567 arcs, 31634 paths. 203.3 kB. 5973 states, 12919 arcs, 28959 paths. 42.5 kB. 1462 states, 2638 arcs, 4780 paths. 93.2 kB. 2950 states, 5873 arcs, 11627 paths. 9.1 kB. 368 states, 510 arcs, 444 paths. 28.5 kB. 955 states, 1744 arcs, 2787 paths. 3.8 kB. 136 states, 184 arcs, 130 paths. 39.2 kB. 1310 states, 2430 arcs, 4182 paths. 40.8 kB. 1375 states, 2525 arcs, unknown number of paths. [...] Writing to file foma257530.fst. Writing AT&T file: foma257530.att Reading AT&T file: foma257530.att 2.7 MB. 71023 states, 174706 arcs, 538081 paths.

O script gera um transdutor no formato binário e no formato de texto puro att.

~/MorphoBr/nouns$ foma

Foma, version 0.9.18alpha (svn r241) Copyright © 2008-2015 Mans Hulden [...]

Type "help" to list all commands available. Type "help " or help "" for further help.

foma[0]: read att foma257530.att Reading AT&T file: foma257530.att 2.7 MB. 71023 states, 174706 arcs, 538081 paths.

foma[1]: up aviãozinho

avião+N+DIM+M+SG

foma[1]:

O script compila diversos transdutores menores e no fim faz a união entre eles.

arademaker commented 1 year ago

os scripts não funcionam para mim, provavelmente por não estarem muito robustos em relação a referências de diretórios e nomes de arquivos. Existe alguma razão especial para decompor o processo nas etapas abaixo? Posso juntar tudo em um único script?

BuildFomaFSTFromPairs.sh que chama

  1. BuildSpacedTextFromFiles.sh que chama BuildSpacedText.py
  2. BuildFomaFSTFromSpacedText.sh
leoalenc commented 1 year ago

os scripts não funcionam para mim, provavelmente por não estarem muito robustos em relação a referências de diretórios e nomes de arquivos. Existe alguma razão especial para decompor o processo nas etapas abaixo? Posso juntar tudo em um único script?

BuildFomaFSTFromPairs.sh que chama

  1. BuildSpacedTextFromFiles.sh que chama BuildSpacedText.py
  2. BuildFomaFSTFromSpacedText.sh

@arademaker , https://github.com/LR-POR/MorphoBr/commit/11dfc159be6753e8c8b85de2578bb5f88a1a6321 deve resolver os problemas que encontrou. Executei este teste:

$ BuildFomaFSTFromPairs.sh verbs-[ef].dict
163.5 kB. 3432 states, 10356 arcs, 272067 paths.
Writing to file tmp370446/verbs-e.dict.stxt.fst.
55.7 kB. 1310 states, 3458 arcs, 68054 paths.
Writing to file tmp370446/verbs-f.dict.stxt.fst.
Value of i verbs-f.dict.stxt
163.5 kB. 3432 states, 10356 arcs, 272067 paths.
55.7 kB. 1310 states, 3458 arcs, 68054 paths.
198.9 kB. 4147 states, 12622 arcs, unknown number of paths.
Writing to file foma370446.fst.
Writing AT&T file: foma370446.att
Reading AT&T file: foma370446.att
198.9 kB. 4147 states, 12622 arcs, 340121 paths.
$ wc -l !$
wc -l verbs-[ef].dict
  272067 verbs-e.dict
   68054 verbs-f.dict
  340121 total

Sobre simplificação, sim, isso pode ser feito. Já eliminei um dos scripts.

leoalenc commented 1 year ago

@arademaker , de um ponto de vista lógico, a compilação das entradas lexicais em um transdutor de estados finitos, realizada por esses scripts, envolve as seguintes etapas:

  1. Conversão das entradas lexicais no formato de texto espaçado. Isso é feito pelo script em Python.
  2. Compilação de vários transdutores menores a partir dos arquivos no formato de texto espaçado, seguida da união desses transdutores. Isso é feito por um script em bash, que monta um script do Foma e depois o executa.

Essa segunda etapa poderia naturalmente ter sido feita também em Python, mas eu achei mais fácil em bash.

arademaker commented 1 year ago

The XFST license limited its uses to non-commercial applications. FOMA is licensed under Apache License, version 2, which is more flexible.

The Haskell code in the code folder replaces the BuildSpacedText.py script. It reads the dict files and produces the spaced-text format as described above.

The process described above and the scripts mentioned are used to compile FSTs for each file and combine them into a final one, overcoming the default limited STACK size in FOMA source code. In the README.org file (root folder) and here, I reported that I was able to recompile the FOMA from its source with increased STACK size. Using it, I could compile the final FST of the complete Morpho-BR. Please take a look at README.org for more details.

I admit that the current compile.sh is less robust and parameterized than the bash and Python scripts removed from the repository. But it is also more straightforward to understand.

The Haskell code could be a script, but I started a library. This code can be expanded to incorporate more functionalities and integrate other scripts in the scripts tools folder. I still need to document the compilation of the Haskell code. But its use is exemplified in the compile.sh.

In the next release, we can also consider attaching the FOMA binary to facilitate the use of the resource.

The 8613d96 closes this issue.