Language analysis improvements

valdisvi commented 7 years ago

Language analysis and spelling decisions could be improved by introducing following new features:

[ ] extend verb follows/noun follows marks to more/arbitrary flags, which then can be used to make different pronunciation rules for homonyms
[x] J statement as precondition, to allow choosing pronunciation from preceding word.
[x] J statement should support letter groups e.g. (JL01 as marking letters. This could help solving names of numbers as different words #83
[ ] possibility to go back to start of the rules and redo analysis again (e.g. issue #121 not only after removing pre/suf-fixes. Could be performance drain, if used improperly.)
[x] #489 .replace rule with extended trace
[ ] .replace rule after looking in ..._list files
[ ] replace rule extended to replace not only characters, but group of characters,
[ ] also probably replace using matching rules
[ ] _list extended to mark arbitrary defined word types (e.g. $units #115) and by comparing only root part of the word (i.e. partial match without pre/suffixes). See issue #263 for details.
[ ] output (prosody data) extended to mark syllables with more/arbitrary defined ways for different pronuciations (e.g. high/low pitch for Chinese etc.)
[x] Fix issue #196 Word end mark _ doesn't work properly with ~ character group.
[ ] Common rule for stress decision before or after specific spelling decision of word is made. E.g. to put stress for penultimate syllable in Italian #80 as common rule.
[ ] Improved support for word boundaries in .._list file. See for example workaround for German ja

rhdunn commented 7 years ago

Using .replace to expand to multiple letters is working for me (i.e. the replace rules in en_rules). Are there specific cases that are not working?

valdisvi commented 7 years ago

Yes, because in compiledict.c bytes are compressed into integer with utf8_in function, and then only these 4 bytes are written with Write4Bytes. That produces wrong result, if there are too many "meaningful" bytes in from or to part of replacement. So, universal .replace implementation requires to replace arbitrary number of from bytes to arbitrary number of to bytes. To test it, just add rule e.g.

.replace
 æ    are
 are  usi
 ša   ra
//etc. with even more bytes in from or to part of replacement

espeak-ng / espeak-ng

Language analysis improvements #199