cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation
Apache License 2.0
31 stars 13 forks source link

Multi-segment graphemes #34

Open xrotwang opened 6 years ago

xrotwang commented 6 years ago

Handling of multi-segment graphemes is too intransparent. Whenever I look at the segments code, I find it difficult to wrap my head around the layers of string splitting/concatenating. To me, it would seem natural that internally, the data the tokenizer creates are lists of lists:

[
    ['f', 'i', 'r', 's', 't'],
    ['w', 'o', 'r', 'd']
]

And whenever I try to implement it this way, I hit a wall, because the fact that even internally, the data looks like

"f i r s t # w o r d"

is actually exploited (and relied upon) in the orthography profiles: To specify a grapheme that is to be split into two segments, you could use this profile:

Grapheme Out
sch      s ch

But that's cheating, or hacky, because when the parser encounters sch, it should be appending two segments to the output, but instead appends one "segment" s ch which just happens to look exactly like two segments in the output.

Even worse, there is no way to specify special cases using multi-segment graphemes in profiles. E.g. to differentiate the segmentation of sch in German "bischen" from "naschen" one has to use something like

Grapheme Out
bischen  b i s ch e n
sch      sch

Wouldn't it be cool, if the same could be had with a profile like

Grapheme
b-i-s-ch-e-n
sch
ch
b
i
s
e
n

I think, with CSVW and the nice separator property, multi-segment graphemes could be handled fully transparently:

The profile above could be described as

{
    "name": "Grapheme",
    "propertyUrl": "http://cldf.clld.org/grapheme",
    "separator": "-"
}

Then the parser would read the first line as

grapheme = ['b', 'i', 's', 'ch', 'e', 'n']

Processing would happen as follows:

With this scheme, we could even have cross-word graphemes, e.g.

Grapheme Out
u k      u-k

would tokenize zu klein as ['z', 'u k', 'l', 'e', 'i', 'n'] and transliterate it as ['z', 'u', 'k', 'l', 'e', 'i', 'n']. While somewhat artificial, this could be used to deal with degenerate cases in lexibank, where we sometimes get multi-word expression when we only expect a single lexeme.

xrotwang commented 6 years ago

@bambooforest @tresoldi @LinguList thoughts?

LinguList commented 6 years ago

Yes, it's exactly one problem that caused me some headache in the past, but which I ignored: as we insert spaces from non-spaced parts as a way of segmenting things, we create code that is no longer mapping n segments in to n segments out. This means: essentially we loose the advantage of transparency. I'd have to digest your proposal further, and we should also think about all consequences WHERE this would yield changes (lingpy would be impacted as well from it, also the tutorial, etc.). So let's not overrush this and make a list of dependencies that NEED to be changed (and assign them to each other) before implementing this behaviour...

tresoldi commented 6 years ago

The current state is hacky, @xrotwang is right, and I would favor a system more in line with he proposes (even knowing we'll always need some way to kind unexplainable exceptions, as this is natural language writing after all). Such orthographic profiles would not be useful only when using segments, but would be a more general attribute of a language/writing system, even allowing research on the topic.

That said, I understand @LinguList position on not breaking things, but we might be facing a Python2/3 situation here. Recent changes to the whole infrastructure (as in the current reworking of lexibank, which is bringing to light this and other issues) probably imply that some things will break, but keeping the material already produced is essential. How is the roadmap for a lingpy3? I don't really like the idea, but maybe we should have a stable (version 2) and in development (version 3) branch?

xrotwang commented 6 years ago

AFAICT my proposal wouldn't necessarily break anything - we can simply consider the current behaviour as the default, in the absence of a description file. The latest PR #35 already handles the data internally as lists, serializing to string only upon output in Tokenizer.__call__. If " " is used as default segment separator, nothing would break.

fmatter commented 3 years ago

Just chiming in to let you all know that I've encountered precisely this issue in a use case.

LinguList commented 3 years ago

I think the implementation would not require too much efforts, as it would only have to add a specific separator to the rules (which needs to be agreed upon, since - is often used so far. I would also be in favor of decreasing the "hackiness" here, and one reason is also that scholars who see profiles for the first time often have a hard time to understand that we segment by putting things together in the left column and then separating them on the right by a space. Educationally, this is also costing a lot of energy, judging from my experience over the past years.

bambooforest commented 2 years ago

Not sure if the updates discussed here would handle this case @xrotwang , but for example:

s4 'k͡pãs'

t(s4) # not good wrt IPA 'k͡ p ã s'

t(s4, ipa=True) # looks good 'k͡p ã s'

print(segments.Profile.from_text(s4)) # expected default behavior Grapheme frequency mapping k͡ 1 k͡ p 1 p ã 1 ã s 1 s

print(segments.Profile.from_text(t(s4, ipa=True))) # not expected behavior because NFD + \X on 'k͡p' instead of list of graphemes to OP

Grapheme frequency mapping 2
k͡ 1 k͡ p 1 p ã 1 ã s 1 s