segmentation fault when invoked with a missing [[tokenizer]] section in the configuration

proycon commented 4 years ago

(gdb) run /home/proycon/work/glem/glem/glem.py -f greek.txt
Starting program: /data2/dev/bin/python3 /home/proycon/work/glem/glem/glem.py -f greek.txt
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
INITIALISE FROG
20191101:111809:556:Missing [[parser]] section in config file.
20191101:111809:556:Disabled the parser.
[New Thread 0x7ffff182d700 (LWP 4100640)]
[New Thread 0x7ffff102c700 (LWP 4100641)]
[New Thread 0x7ffff082b700 (LWP 4100642)]
[New Thread 0x7ffff002a700 (LWP 4100643)]
[New Thread 0x7fffef829700 (LWP 4100644)]
[New Thread 0x7fffef028700 (LWP 4100645)]
[New Thread 0x7fffee827700 (LWP 4100646)]
20191101:111809:559:mblem-:Initiating lemmatizer...
ucto: textcat configured from: /data2/dev/share/ucto/textcat.cfg
20191101:111809:806:tagger-tagger-:reading subsets from /home/proycon/work/glem/glem/pretrained_models/herodotus//subsets
20191101:111809:806:tagger-tagger-:reading constraints from /home/proycon/work/glem/glem/pretrained_models/herodotus//constraints
20191101:111810:149:Fri Nov  1 11:18:10 2019 Initialization done.
READING /home/proycon/work/glem/glem/list_proiel_word_lemma_POS_freq
READING /home/proycon/work/glem/glem/list_proiel_perseus_merged_word_lemma_POS_nofreq
READING /home/proycon/work/glem/glem/extra-wlt.txt

LEMMATISING greek.txt

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
Tokenizer::TokenizerClass::reset (this=0x7fffe8000ef0, lang="default") at tokenize.cxx:232
232         settings[lang]->quotes.clearStack();
(gdb) bt
#0  Tokenizer::TokenizerClass::reset (this=0x7fffe8000ef0, lang="default") at tokenize.cxx:232
LanguageMachines/ucto#1  0x00007ffff6582af6 in UctoTokenizer::tokenize_stream (this=0x7fffe8000b60, is=...) at /usr/include/c++/9.2.0/bits/char_traits.h:300
LanguageMachines/ucto#2  0x00007ffff64de605 in FrogAPI::run_text_engine (this=0x555555847d80, infilename=..., os=...) at FrogAPI.cxx:1740
LanguageMachines/ucto#3  0x00007ffff64e4705 in FrogAPI::FrogFile (this=0x555555847d80, infilename="/tmp/frogRq3cJZ", os=...) at FrogAPI.cxx:1775
LanguageMachines/ucto#4  0x00007ffff64e5030 in FrogAPI::Frogtostringfromfile (this=0x555555847d80, infilename="/tmp/frogRq3cJZ") at FrogAPI.cxx:1019
LanguageMachines/ucto#5  0x00007ffff64e5423 in FrogAPI::Frogtostring (this=0x555555847d80, 
    s="Θουκυδίδης Ἀθηναῖος ξυνέγραψε τὸν πόλεμον τῶν Πελοποννησίων καὶ Ἀθηναίων, ὡς ἐπολέμησαν πρὸς ἀλλήλους,"...) at FrogAPI.cxx:1004
LanguageMachines/ucto#6  0x00007ffff6ab4e50 in __pyx_pf_4frog_4Frog_2process_raw (__pyx_v_text=<optimized out>, __pyx_v_self=0x555555962100) at frog_wrapper.cpp:3770
LanguageMachines/ucto#7  __pyx_pw_4frog_4Frog_3process_raw (__pyx_v_self=0x555555962100, __pyx_v_text=<optimized out>) at frog_wrapper.cpp:3726
LanguageMachines/ucto#8  0x00007ffff6abef60 in __Pyx_PyObject_CallMethO (arg=0x555555963950, func=0x7fffbdc1f6e0) at frog_wrapper.cpp:7074
LanguageMachines/ucto#9  __Pyx_PyObject_CallOneArg (arg=<optimized out>, func=0x7fffbdc1f6e0) at frog_wrapper.cpp:7150
LanguageMachines/ucto#10 __pyx_pf_4frog_4Frog_6process (__pyx_v_text=0x555555963950, __pyx_v_self=<optimized out>) at frog_wrapper.cpp:4628
LanguageMachines/ucto#11 __pyx_pw_4frog_4Frog_7process (__pyx_v_self=<optimized out>, __pyx_v_text=<optimized out>) at frog_wrapper.cpp:4271
LanguageMachines/ucto#12 0x00007ffff7b37463 in _PyMethodDef_RawFastCallKeywords () from /usr/lib/libpython3.7m.so.1.0
LanguageMachines/ucto#13 0x00007ffff7b6976f in _PyMethodDescr_FastCallKeywords () from /usr/lib/libpython3.7m.so.1.0
LanguageMachines/ucto#14 0x00007ffff7b698f9 in ?? () from /usr/lib/libpython3.7m.so.1.0
LanguageMachines/ucto#15 0x00007ffff7ba0d95 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.7m.so.1.0
LanguageMachines/ucto#16 0x00007ffff7b588b3 in _PyFunction_FastCallKeywords () from /usr/lib/libpython3.7m.so.1.0
LanguageMachines/ucto#17 0x00007ffff7b69820 in ?? () from /usr/lib/libpython3.7m.so.1.0
LanguageMachines/ucto#18 0x00007ffff7ba0cfd in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.7m.so.1.0
LanguageMachines/ucto#19 0x00007ffff7b57888 in _PyEval_EvalCodeWithName () from /usr/lib/libpython3.7m.so.1.0
LanguageMachines/ucto#20 0x00007ffff7b58a53 in _PyFunction_FastCallKeywords () from /usr/lib/libpython3.7m.so.1.0
LanguageMachines/ucto#21 0x00007ffff7b69820 in ?? () from /usr/lib/libpython3.7m.so.1.0
LanguageMachines/ucto#22 0x00007ffff7ba0cfd in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.7m.so.1.0
LanguageMachines/ucto#23 0x00007ffff7b57888 in _PyEval_EvalCodeWithName () from /usr/lib/libpython3.7m.so.1.0
LanguageMachines/ucto#24 0x00007ffff7b587aa in PyEval_EvalCodeEx () from /usr/lib/libpython3.7m.so.1.0
LanguageMachines/ucto#25 0x00007ffff7bed0fc in PyEval_EvalCode () from /usr/lib/libpython3.7m.so.1.0
LanguageMachines/ucto#26 0x00007ffff7c36a51 in ?? () from /usr/lib/libpython3.7m.so.1.0
LanguageMachines/ucto#27 0x00007ffff7c370db in PyRun_FileExFlags () from /usr/lib/libpython3.7m.so.1.0
LanguageMachines/ucto#28 0x00007ffff7c3e5c7 in PyRun_SimpleFileExFlags () from /usr/lib/libpython3.7m.so.1.0
LanguageMachines/ucto#29 0x00007ffff7c40e30 in ?? () from /usr/lib/libpython3.7m.so.1.0
LanguageMachines/ucto#30 0x00007ffff7c411fc in _Py_UnixMain () from /usr/lib/libpython3.7m.so.1.0
LanguageMachines/ucto#31 0x00007ffff7da9153 in __libc_start_main () from /usr/lib/libc.so.6
LanguageMachines/ucto#32 0x000055555555505e in _start ()

Input:

Θουκυδίδης Ἀθηναῖος ξυνέγραψε τὸν πόλεμον τῶν Πελοποννησίων καὶ Ἀθηναίων, ὡς ἐπολέμησαν πρὸς ἀλλήλους, ἀρξάμενος εὐθὺς καθισταμένου καὶ ἐλπίσας μέγαν τε ἔσεσθαι καὶ ἀξιολογώτατον τῶν προγεγενημένων, τεκμαιρόμενος ὅτι ἀκμάζοντές τε ᾖσαν ἐς αὐτὸν ἀμφότεροι παρασκευῇ τῇ πάσῃ καὶ τὸ ἄλλο Ἑλληνικὸν ὁρῶν ξυνιστάμενον πρὸς ἑκατέρους, τὸ μὲν εὐθύς, τὸ δὲ καὶ διανοούμενον.

proycon commented 4 years ago

It seems they don't use explicitly set a tokeniser (neither by language nor by config) in https://github.com/GreekPerspective/glem/blob/master/glem/pretrained_models/herodotus/frog.cfg.template so I assume it defaults to dutch? I see the sets are left as is too, that's not good..

proycon commented 4 years ago

Setting an explicit tokconfig-generic in glem's frog.cfg seems to solve this.

kosloot commented 4 years ago

In general, it would be helpful to have a clear ucto only proof of the problem. Now it might also be a glem or a frog issue.

But: Not having a tokenizer config in frog AT ALL should be signaled on the startup of frog, (just as it does for the parser: 20191101:111809:556:Missing [[parser]] section in config file. 20191101:111809:556:Disabled the parser. ) I am surprised to not see that in the log above....

The missing [[tokenizer]] section should put the tokenizer in passthru mode, NOT dutch.

Could you test by explicitly setting --skip=t on the frog command line?

kosloot commented 4 years ago

Ok, so this seems to be a Frog problem. the passthru mode seems not to be set correctly when the [[tokenizer]] section is missing Running with --skip=t (which does set passthru) DOES work. Quite spooky.

kosloot commented 4 years ago

solved inUcto

LanguageMachines / frog

segmentation fault when invoked with a missing [[tokenizer]] section in the configuration #83