cldf-clts / clts-legacy

Cross-Linguistic Transcription Systems
Apache License 2.0
4 stars 3 forks source link

Marker object has no attribute s #110

Closed xrotwang closed 6 years ago

xrotwang commented 6 years ago
  File "lexibank/pylexibank/src/pylexibank/lingpy_util.py", line 66, in test_sequence
    bipa_analysis = [BIPA[s] for s in segments]
  File "lexibank/local/lib/python2.7/site-packages/pyclts/models.py", line 47, in __getitem__
    return self.resolve_sound(sound)
  File "lexibank/local/lib/python2.7/site-packages/pyclts/transcriptionsystem.py", line 282, in resolve_sound
    return self._parse(string)
  File "lexibank/local/lib/python2.7/site-packages/pyclts/transcriptionsystem.py", line 254, in _parse
    sound += base_sound.s
AttributeError: 'Marker' object has no attribute 's'
tresoldi commented 6 years ago

I'll take a look.

tresoldi commented 6 years ago

Can you give the value of segments which is causing this?

xrotwang commented 6 years ago
[u'r\u0292', u'_\u0329', u'd']

somewhere in the cals dataset.

tresoldi commented 6 years ago

While CLTS should fail in a proper way, the logic is correct: there is no base sound for a Marker -- that is why it is a marker. The logic is fine in the previous line (transcriptionsystem.py:252), when we are adding the marker representation to the grapheme, but there is nothing to add here.

The most obvious way to solve the bug is to only add base_sound.s if we have, in fact, a Sound; either passing silently in case of Markers or throwing a nice ValueError exception. I am not entirely sure what is the intended behavior here; pinging @LinguList on the matter

tresoldi commented 6 years ago

@xrotwang in any case, it seems that some data is not handled correctly in the cals dataset. While one could have a syllabic sibilant (English "Shh!" for "shut up!"), this looks more like a syllabic /r/ that somewhat got lost (and, in any case, /rʒ/ is... rather strange).

xrotwang commented 6 years ago

@tresoldi maybe the segmentation is already a bit off - it's lingpy.sequence.sound_classes.clean_string. Maybe I should try switching to a bipa orthography profile first.

xrotwang commented 6 years ago

@tresoldi the original form is "rʒ ̩d", and the space is interpreted as word boundary in clean_string, I guess. This introduces the _, to which the diacritic then gets attached?

LinguList commented 6 years ago

I'd be very curious to see how well bipa-ortho-profiles work anyway. One could also feed the profiles all the data that is segmented already (and int eh wild, with errors) within the BDPA database (in fact: BDPA is already linked to CLTS). So we're talking about some huge 8000 segment profile. I'm really curious to see how well it behaves.

tresoldi commented 6 years ago

@xrotwang it makes sense phonologically, even though the lone /d/ is a bit strange.

clean_string() is apparently at fault here (even though I wouldn't blame it -- as a human I'd probably throw an exception here ;) ), but we still need to decide what to do when we receive such marker: fail with more information or pass silently? I'd vote for the second (better to have some partially wrong data which comes from an unusual input than fail) for general behavior, but the first when cleaning/preparing datasets (I do want to find all problems).

LinguList commented 6 years ago

As to the bigger issue of clts: I would say the following: the behavior is clearly a bug in clts. A marker is a limited set of symbols, I'd say, for which we should not do any parsing, but apparently, the algorithm tries to parse, so this should be captured from within the parse code.

tresoldi commented 6 years ago

This case is even more complex, because the grapheme for Marker actually has a diacritic of its own (maybe this is supposed to actually be a null phoneme?).

If you are ok with it, I can add an if isinstance(sound, Marker): raise ValueError test, so we at least fail more properly.

LinguList commented 6 years ago

Since we assume that Markers are all defined and never generated, adding an if-statement after lines 228, namely (transcriptionsystem.py)

        pre, mid, post = nstring.partition(nstring[match[0].start():match[0].end()])
        base_sound = self.sounds[mid]
        if base_sound.type == "Marker": 
            return UnknownSound(grapheme=nstring, source=string, ts=self)

this would just render the Marker plus diacritic as an unknown sound, which would be fine with me.

LinguList commented 6 years ago

sorry: if base_sound.type == 'marker'

LinguList commented 6 years ago

As markers are usually only one symbol in length, they should not be parsed further. Or am I missing something here?

xrotwang commented 6 years ago

Using sounds.tsv as orthography profile without further knowledge of whether this is applicable, doesn't make too much sense, I guess - although it does something. Here's what the list of words with invalid graphemes starts with:

ID LANGUAGE CONCEPT FORM SEGMENTS
105 unovmetan3 two ikkʲe i kk e
107 unovmetan3 when qaʧɔn q a ɔ n
113 unovmetan3 liver ʤigar i g a r
119 unovmetan3 hair sɔʧ s ɔ
137 unovmetan3 mountain tɔɣ˳ t ɔ ɣ
158 unovmetan3 dust ʧaŋ a ŋ
159 unovmetan3 knee buʤilak b u i l a k
170 unovmetan3 left ʧap a p
178 kiorday4 snow qɒr̯ q ɒ r
tresoldi commented 6 years ago

That is what I was thinking: they are usually one symbol or always, by definition? Second question: shouldn't we at least warn that this is unexpected?

LinguList commented 6 years ago

Is that the 8000 lines sounds.tsv? this IS interesting, as it shows the problems of obvious ambiguity of orthoprofiles: we have kk similar to k: and k+superscript_j, but no kkj, etc. Although it SHOULD find ʤigar and the like.

Maybe, you should use graphemes.tsv instead (reducing identical ones).

xrotwang commented 6 years ago

Ah yes, that's better (although there are tons of duplicate graphemes in there):

ID LANGUAGE CONCEPT FORM SEGMENTS
1027 usoymahalla2 mountain tɔɣ˳ t ɔ ɣ
105 unovmetan3 two ikkʲe i kk e
1086 usoymahalla3 good jaxSə j a x S ə
1175 usoymahalla3 two ikkʲe i kk e
1208 usoymahalla3 mountain tɔɣ˳ t ɔ ɣ
1214 usoymahalla3 animal hajwɔn h a jw ɔ n
1266 usoymahalla1 good jaxS1 j a x S 1
1310 usoymahalla1 leaf barg˳ b a r g
1357 usoymahalla1 two ikkʲe i kk e
137 unovmetan3 mountain tɔɣ˳ t ɔ ɣ
1376 usoymahalla1 seed uruɣ˳ u r u ɣ
1419 usoymahalla1 green zangʲɔrɪ z a ng ɔ r ɪ
1446 kikulanak1 good dZaqS1 d Z a q S 1
1484 kikulanak1 year ʤɪł ʤ ɪ ł
1525 kikulanak1 one bir̯ b i r
1526 kikulanak1 toeat ʤʲɛ ʤ ɛ
LinguList commented 6 years ago

In fact, I think that all lexibank datasets for which we MAKE an orthography profile that actually works, should be used to feed into the bad ortho-profile thing. The only problem are cases of explicit segmentation which is often mixed. Thus, a line where I segment, for example,

scheibe : sch ei b ə

may not be the one we want at a later stage. The problem is that orthoprofiles mix two aspects: explicit segmentation by replacement and segmentation by listing valid graphemes (grapheme = symbol sequence representing a sound). If they only listed graphemes, we could grow our little database of graphemes out of the data we annotate for lexibank.

xrotwang commented 6 years ago

Well, with my proposal at https://github.com/cldf/segments/issues/34 this would be the case, right?

xrotwang commented 6 years ago

@LinguList your example would then appear in the profile as

sch-ei-b-ə
LinguList commented 6 years ago

yes!

LinguList commented 6 years ago

This is another example for the usefulness of the new proposal.

SimonGreenhill commented 6 years ago

I've just hit this problem in grollemundbantu..

LinguList commented 6 years ago

well... grollemundbantu is among the worst cases of segmentation, right? They have tons of different orthographies there.

What I'd say for now is: eyeballing the one-ortho-profile-to-segment-them-all approach is less feasible than clean_string, as it assumes that all segments be known, while clean_string leaves consonants undefined (a large number of segments). The "+" marker instance, if not occuring alone, should convert to an unknown sound (as indicated in my proposal for re-coding above). The question of whether the base-profile-for-segmentation can be used in the future depends on the growth of our lexibank datasets and the expert-judged ortho-profiles. If some 50000 segments are NOT enough to yield the segmentations provided by clean_string in a similar quality, we may hold on to the approach for much longer than thought (or leave it always as an alternative solution).

SimonGreenhill commented 6 years ago

I like how optimistic you are that this is the worst case scenario! I'm sure we'll find worse...