Closed pirolen closed 3 years ago
As far as I can see this is fixed now in the master branch
I updated my LaMachine dev, and the fix works, thanks!
@proycon: After updating, calling python-ucto from a .py script gives an error:
ucto: textcat configured from: /home/ubuntu/piro/projects/lamadev/lmdev/share/ucto/textcat.cfg free(): invalid next size (normal) Aborted
In the script I have: configurationfile = "tokconfig-deu" tokenizer = ucto.Tokenizer(configurationfile)
In the python interpreter this works fine.
Does it also work fine when you close the interpreter? I encountered such free()
issues before.
Usually a forced recompilation of python-ucto helps, you may want to explicitly delete $LM_PREFIX/src/python-ucto
and $LM_PREFIX/lib/python3.8/site-packages/*ucto*so
(the python version number may differ) and run lamachine-update
again. Running it with the force=1
option should do it automatically but takes longer.
Does it also work fine when you close the interpreter?
If I simply run the script on the command line, which has nothing in it just calling ucto, then I get the error.
you may want to explicitly delete
$LM_PREFIX/src/python-ucto
and$LM_PREFIX/lib/python3.8/site-packages/*ucto*so
(the python version number may differ) and runlamachine-update
again.
OK, am updating after having deleted those two.
Before I did only a partial lamachine-update (since the entire update fails), and will try with partial updates again. Not sure if the issue might be caused by that, i.e. lamachine-update --only languagemachines-python languagemachines-basic
If it does not help, will try the full update with force=1
option.
lamachine-update --only languagemachines-python languagemachines-basic
OK, probably not a surprise to you, this did not work out (got a No module named 'ucto'
afterwards).
Trying the full update.
Sorry, you need languagemachines-python-bindings
for this one.
Thanks! Updating that fails unfortunately, likely for the same reason as for the full update (python-frog; https://github.com/proycon/LaMachine/issues/196). Please find the log attached.
Ok, python-frog fails on:
no matching function for call to ‘FrogAPI::FrogAPI(FrogOptions&, TiCC::Configuration&, TiCC::LogStream*, TiCC::LogStream*)’
This has indeed changed in the more recent Frog.
I didn't realize python-frog uses this constructor. But it does, so some adjustments are needed.
@proycon sorry about this.
@pirolen I hope @proycon will fix this soon
This affects the development Frog only right? We'll have to update the development python-frog then but no need to rush a release?
@kosloot In the new situation there seems to be no way anymore to pass TiCC::Configuration& ? We need that for python-frog.
(I'm only using the dev LaMachine.)
I extended the FrogAPI with a constructor using TiCC::Configuration. This should help you fixing this
Ok, that was a bad idea. Reverted it
@pirolen Issue proycon/python-frog#20 should be fixed now, the development lamachine should compile again.
You rock! Update and ucto work fine now. Will test the entire pipeline further.
Would there be a way with foliapy to access the text in a t-style element without the t-hspace? In the .text() method I did not find a corresponding parameter.
Why would you want to do this? the <t-hspace>
is inserted because a space is present in the original Abbyy file (at least that is intended) and that implies we must assume it was present in the original text too.
So why ditch it then?
Side note : <t-hspace>
is only used for trailing or leading spaces. So I assume that a simple Sting.trim() would do what you want?
I came across a case when the superscripted item ends in such a trailing space. The superscripts denote footnote reference indices, which I link to a corresponding note item in the footnote area, by string matching. But the trailing space breaks the matching. I was just wondering if I should find a foliapy-ic solution or just strip the space.
Sure, the space is present in the original (pls see screenshots).
So the FoLiA reflects the original document. That is what we want. It's up to the user for more or alternative interpretations.
I assume we can close this discussion now.
I have just came across this: When t-style has a font_typeface Feature (e.g. with superscript as class), seems like in certain contexts the styled item is not split from its left neighbor, despite the t-style tag="token".
Two examples for correct split:
`
`
—>
and
`
`
—>
Two examples for no split:
`
`
—>
and
`
`
—>