Closed xrotwang closed 6 years ago
I'll take a look.
Can you give the value of segments
which is causing this?
[u'r\u0292', u'_\u0329', u'd']
somewhere in the cals dataset.
While CLTS should fail in a proper way, the logic is correct: there is no base sound for a Marker -- that is why it is a marker. The logic is fine in the previous line (transcriptionsystem.py:252), when we are adding the marker representation to the grapheme, but there is nothing to add here.
The most obvious way to solve the bug is to only add base_sound.s
if we have, in fact, a Sound; either passing silently in case of Markers or throwing a nice ValueError
exception. I am not entirely sure what is the intended behavior here; pinging @LinguList on the matter
@xrotwang in any case, it seems that some data is not handled correctly in the cals dataset. While one could have a syllabic sibilant (English "Shh!" for "shut up!"), this looks more like a syllabic /r/ that somewhat got lost (and, in any case, /rʒ/ is... rather strange).
@tresoldi maybe the segmentation is already a bit off - it's lingpy.sequence.sound_classes.clean_string
. Maybe I should try switching to a bipa orthography profile first.
@tresoldi the original form is "rʒ ̩d", and the space is interpreted as word boundary in clean_string
, I guess. This introduces the _
, to which the diacritic then gets attached?
I'd be very curious to see how well bipa-ortho-profiles work anyway. One could also feed the profiles all the data that is segmented already (and int eh wild, with errors) within the BDPA database (in fact: BDPA is already linked to CLTS). So we're talking about some huge 8000 segment profile. I'm really curious to see how well it behaves.
@xrotwang it makes sense phonologically, even though the lone /d/ is a bit strange.
clean_string()
is apparently at fault here (even though I wouldn't blame it -- as a human I'd probably throw an exception here ;) ), but we still need to decide what to do when we receive such marker: fail with more information or pass silently? I'd vote for the second (better to have some partially wrong data which comes from an unusual input than fail) for general behavior, but the first when cleaning/preparing datasets (I do want to find all problems).
As to the bigger issue of clts: I would say the following: the behavior is clearly a bug in clts. A marker is a limited set of symbols, I'd say, for which we should not do any parsing, but apparently, the algorithm tries to parse, so this should be captured from within the parse code.
This case is even more complex, because the grapheme for Marker actually has a diacritic of its own (maybe this is supposed to actually be a null phoneme?).
If you are ok with it, I can add an if isinstance(sound, Marker): raise ValueError
test, so we at least fail more properly.
Since we assume that Markers are all defined and never generated, adding an if-statement after lines 228, namely (transcriptionsystem.py)
pre, mid, post = nstring.partition(nstring[match[0].start():match[0].end()])
base_sound = self.sounds[mid]
if base_sound.type == "Marker":
return UnknownSound(grapheme=nstring, source=string, ts=self)
this would just render the Marker plus diacritic as an unknown sound, which would be fine with me.
sorry: if base_sound.type == 'marker'
As markers are usually only one symbol in length, they should not be parsed further. Or am I missing something here?
Using sounds.tsv
as orthography profile without further knowledge of whether this is applicable, doesn't make too much sense, I guess - although it does something. Here's what the list of words with invalid graphemes starts with:
ID | LANGUAGE | CONCEPT | FORM | SEGMENTS |
---|---|---|---|---|
105 | unovmetan3 | two | ikkʲe | i kk |
107 | unovmetan3 | when | qaʧɔn | q a |
113 | unovmetan3 | liver | ʤigar | |
119 | unovmetan3 | hair | sɔʧ | s ɔ |
137 | unovmetan3 | mountain | tɔɣ˳ | t ɔ ɣ |
158 | unovmetan3 | dust | ʧaŋ | |
159 | unovmetan3 | knee | buʤilak | b u |
170 | unovmetan3 | left | ʧap | |
178 | kiorday4 | snow | qɒr̯ | q ɒ r |
That is what I was thinking: they are usually one symbol or always, by definition? Second question: shouldn't we at least warn that this is unexpected?
Is that the 8000 lines sounds.tsv? this IS interesting, as it shows the problems of obvious ambiguity of orthoprofiles: we have kk similar to k:
and k+superscript_j, but no kkj, etc. Although it SHOULD find ʤigar and the like.
Maybe, you should use graphemes.tsv instead (reducing identical ones).
Ah yes, that's better (although there are tons of duplicate graphemes in there):
ID | LANGUAGE | CONCEPT | FORM | SEGMENTS |
---|---|---|---|---|
1027 | usoymahalla2 | mountain | tɔɣ˳ | t ɔ ɣ |
105 | unovmetan3 | two | ikkʲe | i kk |
1086 | usoymahalla3 | good | jaxSə | j a x |
1175 | usoymahalla3 | two | ikkʲe | i kk |
1208 | usoymahalla3 | mountain | tɔɣ˳ | t ɔ ɣ |
1214 | usoymahalla3 | animal | hajwɔn | h a |
1266 | usoymahalla1 | good | jaxS1 | j a x |
1310 | usoymahalla1 | leaf | barg˳ | b a r g |
1357 | usoymahalla1 | two | ikkʲe | i kk |
137 | unovmetan3 | mountain | tɔɣ˳ | t ɔ ɣ |
1376 | usoymahalla1 | seed | uruɣ˳ | u r u ɣ |
1419 | usoymahalla1 | green | zangʲɔrɪ | z a ng |
1446 | kikulanak1 | good | dZaqS1 | d |
1484 | kikulanak1 | year | ʤɪł | ʤ ɪ |
1525 | kikulanak1 | one | bir̯ | b i r |
1526 | kikulanak1 | toeat | ʤʲɛ | ʤ |
In fact, I think that all lexibank datasets for which we MAKE an orthography profile that actually works, should be used to feed into the bad ortho-profile thing. The only problem are cases of explicit segmentation which is often mixed. Thus, a line where I segment, for example,
scheibe : sch ei b ə
may not be the one we want at a later stage. The problem is that orthoprofiles mix two aspects: explicit segmentation by replacement and segmentation by listing valid graphemes (grapheme = symbol sequence representing a sound). If they only listed graphemes, we could grow our little database of graphemes out of the data we annotate for lexibank.
Well, with my proposal at https://github.com/cldf/segments/issues/34 this would be the case, right?
@LinguList your example would then appear in the profile as
sch-ei-b-ə
yes!
This is another example for the usefulness of the new proposal.
I've just hit this problem in grollemundbantu..
well... grollemundbantu is among the worst cases of segmentation, right? They have tons of different orthographies there.
What I'd say for now is: eyeballing the one-ortho-profile-to-segment-them-all approach is less feasible than clean_string, as it assumes that all segments be known, while clean_string leaves consonants undefined (a large number of segments). The "+" marker instance, if not occuring alone, should convert to an unknown sound (as indicated in my proposal for re-coding above). The question of whether the base-profile-for-segmentation can be used in the future depends on the growth of our lexibank datasets and the expert-judged ortho-profiles. If some 50000 segments are NOT enough to yield the segmentations provided by clean_string in a similar quality, we may hold on to the approach for much longer than thought (or leave it always as an alternative solution).
I like how optimistic you are that this is the worst case scenario! I'm sure we'll find worse...