apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium
http://wiki.apertium.org/wiki/Lttoolbox
GNU General Public License v2.0
18 stars 22 forks source link

Implement a tool to extract segments (morphemes) from .dix files #88

Open ftyers opened 4 years ago

ftyers commented 4 years ago

In addition to #78, it would be great to have a tool, let's call it lt-segment that would calculate a segment vocabulary from a .dix file. E.g.

...
<pardef n="cat__n">
<e><p><l></l><r><s n="n"/><s n="sg"/></r></p></e>
<e><p><l>s</l><r><s n="n"/><s n="pl"/></r></p></e>
</pardef>
<pardef n="m/ouse__n">
<e><p><l>ouse</l><r>ouse<s n="n"/><s n="sg"/></r></p></e>
<e><p><l>ice</l><r>ouse<s n="n"/><s n="pl"/></r></p></e>
</pardef>
<pardef n="happ/y__adj">
<e><p><l>y</l><r>y<s n="adj"/></r></p></e>
<e><p><l>ier</l><r>y<s n="adj"/><s n="comp"/></r></p></e>
<e><p><l>iest</l><r>y<s n="adj"/><s n="comp"/></r></p></e>
</pardef>

<e><i>cat</i><par n="cat__n"/></e>
<e><i>bat</i><par n="cat__n"/></e>
<e><i>happ</i><par n="happ/y__adj"/></e>
<e><i>eas</i><par n="happ/y__adj"/></e>
<e><i>m</i><par n="m/ouse__n"/></e>
<e><i>l</i><par n="m/ouse__n"/></e>

Would produce something like

cat bat happ eas m l @s @ouse @ice @y @ier @iest

It could also be good to have the frequency.

ftyers commented 4 years ago

Hmm, with the addition of morpheme boundaries (#89), this should probably just calculate the segments with <m/>, or it could have a "heuristic" mode too that adds them based on paradigm breaks.