Open xrotwang opened 6 years ago
@bambooforest @tresoldi @LinguList thoughts?
Yes, it's exactly one problem that caused me some headache in the past, but which I ignored: as we insert spaces from non-spaced parts as a way of segmenting things, we create code that is no longer mapping n segments in to n segments out. This means: essentially we loose the advantage of transparency. I'd have to digest your proposal further, and we should also think about all consequences WHERE this would yield changes (lingpy would be impacted as well from it, also the tutorial, etc.). So let's not overrush this and make a list of dependencies that NEED to be changed (and assign them to each other) before implementing this behaviour...
The current state is hacky, @xrotwang is right, and I would favor a system more in line with he proposes (even knowing we'll always need some way to kind unexplainable exceptions, as this is natural language writing after all). Such orthographic profiles would not be useful only when using segments
, but would be a more general attribute of a language/writing system, even allowing research on the topic.
That said, I understand @LinguList position on not breaking things, but we might be facing a Python2/3 situation here. Recent changes to the whole infrastructure (as in the current reworking of lexibank, which is bringing to light this and other issues) probably imply that some things will break, but keeping the material already produced is essential. How is the roadmap for a lingpy3? I don't really like the idea, but maybe we should have a stable (version 2) and in development (version 3) branch?
AFAICT my proposal wouldn't necessarily break anything - we can simply consider the current behaviour as the default, in the absence of a description file. The latest PR #35 already handles the data internally as lists, serializing to string only upon output in Tokenizer.__call__
. If "
" is used as default segment separator, nothing would break.
Just chiming in to let you all know that I've encountered precisely this issue in a use case.
I think the implementation would not require too much efforts, as it would only have to add a specific separator to the rules (which needs to be agreed upon, since -
is often used so far. I would also be in favor of decreasing the "hackiness" here, and one reason is also that scholars who see profiles for the first time often have a hard time to understand that we segment by putting things together in the left column and then separating them on the right by a space. Educationally, this is also costing a lot of energy, judging from my experience over the past years.
Not sure if the updates discussed here would handle this case @xrotwang , but for example:
s4 'k͡pãs'
t(s4) # not good wrt IPA 'k͡ p ã s'
t(s4, ipa=True) # looks good 'k͡p ã s'
print(segments.Profile.from_text(s4)) # expected default behavior Grapheme frequency mapping k͡ 1 k͡ p 1 p ã 1 ã s 1 s
print(segments.Profile.from_text(t(s4, ipa=True))) # not expected behavior because NFD + \X on 'k͡p' instead of list of graphemes to OP
Grapheme frequency mapping
2
k͡ 1 k͡
p 1 p
ã 1 ã
s 1 s
Handling of multi-segment graphemes is too intransparent. Whenever I look at the
segments
code, I find it difficult to wrap my head around the layers of string splitting/concatenating. To me, it would seem natural that internally, the data the tokenizer creates are lists of lists:And whenever I try to implement it this way, I hit a wall, because the fact that even internally, the data looks like
is actually exploited (and relied upon) in the orthography profiles: To specify a grapheme that is to be split into two segments, you could use this profile:
But that's cheating, or hacky, because when the parser encounters
sch
, it should be appending two segments to the output, but instead appends one "segment"s ch
which just happens to look exactly like two segments in the output.Even worse, there is no way to specify special cases using multi-segment graphemes in profiles. E.g. to differentiate the segmentation of
sch
in German "bischen" from "naschen" one has to use something likeWouldn't it be cool, if the same could be had with a profile like
I think, with CSVW and the nice separator property, multi-segment graphemes could be handled fully transparently:
The profile above could be described as
Then the parser would read the first line as
Processing would happen as follows:
''.join(grapheme)
for matchinggrapheme
) to the outputWith this scheme, we could even have cross-word graphemes, e.g.
would tokenize
zu klein
as['z', 'u k', 'l', 'e', 'i', 'n']
and transliterate it as['z', 'u', 'k', 'l', 'e', 'i', 'n']
. While somewhat artificial, this could be used to deal with degenerate cases in lexibank, where we sometimes get multi-word expression when we only expect a single lexeme.