Open eagad opened 3 years ago
It's because you have an unescaped /
in your input string.
How would you escape it?
apertium.analyze('en', r'Hi\/Hello')
throws the same exception
'Hi\\/Hello'
the escape has to get to the underlying pipe
This still didn't work
apertium.analyze('en', 'Hi\\/Hello')
terminate called after throwing an instance of 'Exception' what(): Error: Malformed input stream. Aborted (core dumped)
Also, is there specific list for characters that need to be escaped?
This still didn't work
apertium.analyze('en', 'Hi\\/Hello')
terminate called after throwing an instance of 'Exception' what(): Error: Malformed input stream. Aborted (core dumped)
Also, is there specific list for characters that need to be escaped?
Try adding another backslash ? :)
seems that backslashes are only interpreted as backslashes here... Any ideas other than removing all the forward slashes from the text I am trying to process?
Probably what this indicates is that there should be a way to have analyse()
invoke deformatters if there isn't already.
Also, I think this should actually be on https://github.com/apertium/apertium-python but I for some reason am not able to transfer it there
Dear colleagues, thank you for your work.
How do i fix this? Some workaround maybe.
Minimal example:
ESC_PATTERN = re.compile("([/^$<>*{}\\\\@#+~])", re.UNICODE)
analyzer = apertium.Analyzer("kir")
text = "Кыргызстанда ВИЧ/СПИД менен күрөшүүгө акча жетишпейт."
text = re.sub(ESC_PATTERN, r"\\\\\1", text.strip())
print(text)
analysis: List[LexicalUnit] = analyzer.analyze(text)
print([lexical_unit.wordform for lexical_unit in analysis])
Output
Кыргызстанда ВИЧ\\/СПИД менен күрөшүүгө акча жетишпейт.
Error: malformed input stream: Found unexpected character / unescaped in stream
: iostream error
['Кыргызстанда', 'ВИЧ', '\\\\/\\\\<sent>']
Thanks in advance.
My own workaround is the following
SPECIAL_CHARACTERS = list("/^$<>*{}\\@#+~")
REPLACEMENTS = ["shashchar", "capchar", "dollarchar", "lesschar", "morechar", "astchar",
"curlyleftchar", "curlyrightchar", "backslashchar", "atchar", "hashchar",
"pluschar", "tildechar"]
assert len(SPECIAL_CHARACTERS) == len(REPLACEMENTS)
spchar2code = {ch: co for ch, co in zip(SPECIAL_CHARACTERS, REPLACEMENTS)}
code2spchar = {co: ch for ch, co in zip(SPECIAL_CHARACTERS, REPLACEMENTS)}
analyzer = apertium.Analyzer("kir")
text = "Кыргызстанда ВИЧ/СПИД менен күрөшүүгө акча жетишпейт."
for spc in spchar2code:
text = text.replace(spc, f" {spchar2code[spc]} ")
print(text)
analysis: List[LexicalUnit] = analyzer.analyze(text)
tokens = [lu.wordform if lu.wordform not in code2spchar else code2spchar[lu.wordform] for lu in analysis]
print(tokens)
but clearly that's not how the cool kids should do it.
I would maybe just send it through apertium-destxt, though I don't know if apertium-python has some builtin way or you have to subprocess.communicate yourself
Thank you, will give it a try!
I am running apertium analyzer from a python script. I get this exception that terminates the script immediately. I am not able to catch it inside python, it seems like it's happenning in c++ and doesn't get handle, how can I handle it?
To replicate the issue: