LanguageMachines / libfolia

FoLiA library for C++
https://proycon.github.io/folia
GNU General Public License v3.0
15 stars 7 forks source link

Discrepancy between foliapy and libfolia in stripping control characters in normalize_spaces() #55

Open proycon opened 5 months ago

proycon commented 5 months ago

normalize_spaces() is used in text validation, currently foliapy (v2.5.11) and libfolia behave differently here regarding control characters:

This issue arose from @martinreynaert 's data, where we see for example:

Expected: Vierstellen-Prädikate bildende Operator „ “ mit dem Zweistellen-Prädikat
Found: Vierstellen-Prädikate bildende Operator „“ mit dem Zweistellen-Prädikat     
******* DEVIATION POINT: Operator „<*HERE*>“ mit dem       

Character in question is a 0x7f (DELETE).

It also happens in an instance of hebrew text (I translitterate the hebrew because browsers are too smart in RTL rendering and mess up the point): <0x202d>Tun-<0x202d>Idash which libfolia turns into Tun- Idash (inserts an unwanted space). 0x202d is a left-to-right control override.

kosloot commented 5 months ago

Yes, you are right. Strange oversight. But until now it never caused problems. I am working on a fix (not that difficult) but this change has some ramifications. Especially it means that every file that bothered @martinreynaert should be rerun with ucto based on the fixed libfolia. Manually correction them is also possible. Of course. But to be sure, it would be helpful when @martinreynaert provided me with the original input files of the Wittgenstein and Kierkegaard examples BEFORE ucto was run. I would like be able to check if ucto does the right thing now. thanx

kosloot commented 5 months ago

@proycon This introduces another interesting issue: should we preserve (some?) BiDI information? I think that this is in fact the right thing to do. But it is tricky. I experimented with keeping the LeftToRightOverride character in libfolia, which seems to work fine BUT is also implied amending Ucto. So something to do in a major release cycle. IF we want it Any thought on this?

proycon commented 5 months ago

I was having the same thoughts yeah, preserving the bidi information would indeed be best so I'm not entirely happy with our solution now.

One can also argue that FoLiA itself could have explicit contructs for bidi information (in markup annotation), rather than leave it to unicode. (like HTML does it).

But unless there are real use cases for mixed bidirectional text I don't really want to make an issue out of this.

kosloot commented 5 months ago

Aren't the files from @martinreynaert examples of a use case? After removing the Right to Left stuff, displaying will be wrong, I assume. Still I can live with that. But a fix seems possible, and not THAT complicated. And I would be very surprised if it would break a lot of FoLiA in the wild