apertium / organisation

Second point of contact for all things Apertium
https://apertium.org/
19 stars 6 forks source link

Wide ranging problem in regexes in lexc #25

Closed ftyers closed 3 years ago

ftyers commented 3 years ago

Some potentially confusing syntax causing issues like: Error: Invalid dictionary (hint: the left side of an entry is empty)

See for example apertium-tat c5f7e7660096575f20640cd563885dab038f881d

./apertium-krc/apertium-krc.krc.lexc:<(a | å | b | c | d | e | f | g | h | i | j | k | l | m | n | o | ø | p | q | r | s | t | u | v | w | x | y | z)+> NP-UNK ;
./apertium-chv/apertium-chv.chv.lexc:<(a | b | c | d | e | f | g | h | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z)+> NP-UNK ;
./apertium-kir/apertium-kir.kir.lexc:<(a | b | c | d | e | f | g | h | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z)+> NP-UNK ;
./apertium-bak/apertium-bak.bak.lexc:<(M | D | C | L | X | V | I)+> NUM-ROMAN ; ! ""
./apertium-bak/apertium-bak.bak.lexc:<( %* | %# | © | %+ | • | %& | = )+> PUNCT ; ! Use/MT
./apertium-bak/apertium-bak.bak.lexc:<(a | b | c | d | e | f | g | h | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z)+> NP-UNK ;
./apertium-kaz/dev/guesser/guesser.lexc:<(а | ә | б | в | г | ғ | д | е | ё | ж | з | и | і | й | к | қ | л | м | н | ң | о | ө | п | р | с | т | у | ұ | ү | ф | х | һ | ц | ч | ш | щ | ь | ы | ъ | э | ю | я)+> N1 ;
./apertium-kaz/dev/guesser/guesser.lexc:<(а | ә | б | в | г | ғ | д | е | ё | ж | з | и | і | й | к | қ | л | м | н | ң | о | ө | п | р | с | т | у | ұ | ү | ф | х | һ | ц | ч | ш | щ | ь | ы | ъ | э | ю | я)+> A1 ; 
./apertium-kaz/dev/guesser/guesser.lexc:<(а | ә | б | в | г | ғ | д | е | ё | ж | з | и | і | й | к | қ | л | м | н | ң | о | ө | п | р | с | т | у | ұ | ү | ф | х | һ | ц | ч | ш | щ | ь | ы | ъ | э | ю | я)+> V-TV ;
./apertium-kaz/dev/guesser/guesser.lexc:<(а | ә | б | в | г | ғ | д | е | ё | ж | з | и | і | й | к | қ | л | м | н | ң | о | ө | п | р | с | т | у | ұ | ү | ф | х | һ | ц | ч | ш | щ | ь | ы | ъ | э | ю | я)+> V-IV ;  
./apertium-kaz/apertium-kaz.kaz.lexc:!<(а | ә | б | в | г | ғ | д | е | ё | ж | з | и | і | й | к | қ | л | м | н | ң | о | ө | п | р | с | т | у | ұ | ү | ф | х | һ | ц | ч | ш | щ | ь | ы | ъ | э | ю | я)+> N1 ; 
./apertium-kaz/apertium-kaz.kaz.lexc:<(a | b | c | d | e | f | g | h | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z)+> NP-UNK ;
./apertium-kaz/apertium-kaz.kaz.lexc:<( а | ә | б | в | г | ғ | д | е | ё | ж | з | и | і | й | к | қ | л | м | н |
./apertium-tyv/apertium-tyv.tyv.lexc:<(a | b | c | d | e | f | g | h | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z | ä | ö | ü | ï | š | č | ı | î | á | ŋ | ç | ş | ğ | é | í | ó | ú | ð | â | ā | ã | ǔ | ū | ō )+> NP-UNK ;
./apertium-bua/apertium-bua.bua.lexc:<([%0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9]+) %- ([%0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9]+)>  DIGITLEX ; ! Use/Circ
./apertium-uig/apertium-uig.uig.lexc:<(a|á|à|ă|â|ǎ|ä|ā|æ|b|c|ć|č|ç|d|e|é|è|ê|ě|ë|ə|f|g|ğ|h|i|í|ì|ǐ|ï|ī|ı|j|k|l|m|n|ñ|o|ó|ò|ộ|ǒ|ö|ø|ō|p|q|r|s|ŝ|š|ş|t|ţ|u|ú|ù|û|ǔ|ü|ū|v|w|x|y|ý|z)+> BARB ;  ! Use/Ortho
./apertium-uig/apertium-uig.uig.lexc:<(а|б|в|г|ғ|д|е|ә|ж|җ|з|и|й|к|қ|л|м|н|ң|о|ө|п|р|с|т|у|ү|ф|х|һ|ц|ч|ш|щ|ы|ь|э|ю|я)+> BARB ; ! Use/Ortho
./apertium-tur/apertium-tur.tur.lexc:<(a|á|à|ă|â|ǎ|ä|ā|æ|b|c|ć|č|ç|d|e|é|è|ê|ě|ë|ə|f|g|ğ|h|i|í|ì|ǐ|ï|ī|ı|j|k|l|m|n|ñ|o|ó|ò|ộ|ǒ|ö|ø|ō|p|q|r|s|ŝ|š|ş|t|ţ|u|ú|ù|û|ǔ|ü|ū|v|w|x|y|ý|z)+> BARB ; ! Dir/RL
./apertium-tur/dev/25072012-apertium-tur.tur.lexc:<(%0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)+>  DIGITLEX ; ! Use/Circ
./apertium-tuk/apertium-tuk.tuk.lexc:<(%0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)+>  DIGITLEX ; ! Use/Circ
./apertium-kan/apertium-kan.kan.lexc:!<(a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z)+>  __eng;
./apertium-sah/apertium-sah.sah.lexc:<(a | b | c | d | e | f | g | h | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z | ö | ə | é | í | ï | ü )+> NP-UNK ;
./apertium-sah/dev/guesser/guesser.lexc:<(a | b | c | d | e | f | g | h | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z | ö | ə | é | í | ï | ü )+> N1 ;
./apertium-sah/dev/guesser/guesser.lexc:<(a | b | c | d | e | f | g | h | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z | ö | ə | é | í | ï | ü )+> V-TV ;
./apertium-sah/dev/guesser/guesser.lexc:<(a | b | c | d | e | f | g | h | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z | ö | ə | é | í | ï | ü )+> V-IV ;
mr-martian commented 3 years ago

Since ()+ is equivalent to *, it seems to me that it is almost certainly a mistake. I can go through and fix all of these if no one disagrees.

TinoDidriksen commented 3 years ago

Doesn't appear anyone disagreed.

mr-martian commented 3 years ago

Every top-directory .lexc file that has a regex containing )+ has now been corrected.

Yes, CI is failing for all of them, but in all cases except -krc it was failing on the previous commit as well. (-krc manages to fail on cloning the repo, so my changes are probably not the issue.)

TinoDidriksen commented 3 years ago

It's only CircleCI that fails - Travis works. So it's transient.

TinoDidriksen commented 3 years ago

And nightly builds were all happy as well, so it's just CircleCI being silly.