LanguageMachines / libfolia

FoLiA library for C++
https://proycon.github.io/folia
GNU General Public License v3.0
15 stars 7 forks source link

Implement better on-the-fly validation for Text #9

Closed kosloot closed 6 years ago

kosloot commented 7 years ago

See https://github.com/proycon/folia/issues/24 libfolia should be more picky on invalid constructions.

kosloot commented 7 years ago

basis checking is implemented: (optional)

using low-level routines it is still possible to construct invalid FoLiA. needs some work still. Maybe check on write too?

kosloot commented 7 years ago

To summarize: checking text has 2 important directions:

So the first problem is solved. The second is very hard though: You could argue that this a quality problem of the tokenizer. But creating invalid FoLiA and not detecting that (directly) is undesirable. The library should not allow this. And to pinpoint the problem, it should detect that as soon as possible. But that is impossible to achieve: when adding words recursively to an already existing sentence, the 'correct' reading is reached only after the last word is added. What to do then?

kosloot commented 7 years ago

Ok, for a start I implemented the scenario that checks if an added text is a substring of the parent text. For Words is must be an exact substring. For other elements this is relaxed a bit by first removing whitespace totally. This needs to be done because text can have embedded newlines, and for instance the ucto tokenizer is unaware of these.

kosloot commented 7 years ago

OK, that was a bad Bad BAD idea ALL substrings must be (part-of) the parent strings. If the parent has embedded newlines, than that MUST be reflected in
nodes. That isn't easy fixed in libfolia, but can be done in the programs build upon libfolia. Like ucto does now after latest patch. The relaxed check is removed again

kosloot commented 7 years ago

At the end we stick to the 'normalized' string checking, as @proycon suggested in https://github.com/proycon/folia/issues/24 This is implemented now for folia documents with version 1.5 or higher. (which don't exists in the wild yet)

kosloot commented 6 years ago

closing. as this is implemented now.