Marker object has no attribute s

xrotwang commented 6 years ago

  File "lexibank/pylexibank/src/pylexibank/lingpy_util.py", line 66, in test_sequence
    bipa_analysis = [BIPA[s] for s in segments]
  File "lexibank/local/lib/python2.7/site-packages/pyclts/models.py", line 47, in __getitem__
    return self.resolve_sound(sound)
  File "lexibank/local/lib/python2.7/site-packages/pyclts/transcriptionsystem.py", line 282, in resolve_sound
    return self._parse(string)
  File "lexibank/local/lib/python2.7/site-packages/pyclts/transcriptionsystem.py", line 254, in _parse
    sound += base_sound.s
AttributeError: 'Marker' object has no attribute 's'

tresoldi commented 6 years ago

I'll take a look.

tresoldi commented 6 years ago

Can you give the value of segments which is causing this?

xrotwang commented 6 years ago

[u'r\u0292', u'_\u0329', u'd']

somewhere in the cals dataset.

tresoldi commented 6 years ago

While CLTS should fail in a proper way, the logic is correct: there is no base sound for a Marker -- that is why it is a marker. The logic is fine in the previous line (transcriptionsystem.py:252), when we are adding the marker representation to the grapheme, but there is nothing to add here.

The most obvious way to solve the bug is to only add base_sound.s if we have, in fact, a Sound; either passing silently in case of Markers or throwing a nice ValueError exception. I am not entirely sure what is the intended behavior here; pinging @LinguList on the matter

tresoldi commented 6 years ago

@xrotwang in any case, it seems that some data is not handled correctly in the cals dataset. While one could have a syllabic sibilant (English "Shh!" for "shut up!"), this looks more like a syllabic /r/ that somewhat got lost (and, in any case, /rʒ/ is... rather strange).

xrotwang commented 6 years ago

@tresoldi maybe the segmentation is already a bit off - it's lingpy.sequence.sound_classes.clean_string. Maybe I should try switching to a bipa orthography profile first.

xrotwang commented 6 years ago

@tresoldi the original form is "rʒ ̩d", and the space is interpreted as word boundary in clean_string, I guess. This introduces the _, to which the diacritic then gets attached?

LinguList commented 6 years ago

I'd be very curious to see how well bipa-ortho-profiles work anyway. One could also feed the profiles all the data that is segmented already (and int eh wild, with errors) within the BDPA database (in fact: BDPA is already linked to CLTS). So we're talking about some huge 8000 segment profile. I'm really curious to see how well it behaves.

tresoldi commented 6 years ago

@xrotwang it makes sense phonologically, even though the lone /d/ is a bit strange.

clean_string() is apparently at fault here (even though I wouldn't blame it -- as a human I'd probably throw an exception here ;) ), but we still need to decide what to do when we receive such marker: fail with more information or pass silently? I'd vote for the second (better to have some partially wrong data which comes from an unusual input than fail) for general behavior, but the first when cleaning/preparing datasets (I do want to find all problems).

LinguList commented 6 years ago

As to the bigger issue of clts: I would say the following: the behavior is clearly a bug in clts. A marker is a limited set of symbols, I'd say, for which we should not do any parsing, but apparently, the algorithm tries to parse, so this should be captured from within the parse code.

tresoldi commented 6 years ago

This case is even more complex, because the grapheme for Marker actually has a diacritic of its own (maybe this is supposed to actually be a null phoneme?).

If you are ok with it, I can add an if isinstance(sound, Marker): raise ValueError test, so we at least fail more properly.

LinguList commented 6 years ago

Since we assume that Markers are all defined and never generated, adding an if-statement after lines 228, namely (transcriptionsystem.py)

        pre, mid, post = nstring.partition(nstring[match[0].start():match[0].end()])
        base_sound = self.sounds[mid]
        if base_sound.type == "Marker": 
            return UnknownSound(grapheme=nstring, source=string, ts=self)

this would just render the Marker plus diacritic as an unknown sound, which would be fine with me.

LinguList commented 6 years ago

sorry: if base_sound.type == 'marker'

LinguList commented 6 years ago

As markers are usually only one symbol in length, they should not be parsed further. Or am I missing something here?

xrotwang commented 6 years ago

Using sounds.tsv as orthography profile without further knowledge of whether this is applicable, doesn't make too much sense, I guess - although it does something. Here's what the list of words with invalid graphemes starts with:

ID	LANGUAGE	CONCEPT	FORM	SEGMENTS
105	unovmetan3	two	ikkʲe	i kk � e
107	unovmetan3	when	qaʧɔn	q a � ɔ n
113	unovmetan3	liver	ʤigar	� i g a r
119	unovmetan3	hair	sɔʧ	s ɔ �
137	unovmetan3	mountain	tɔɣ˳	t ɔ ɣ �
158	unovmetan3	dust	ʧaŋ	� a ŋ
159	unovmetan3	knee	buʤilak	b u � i l a k
170	unovmetan3	left	ʧap	� a p
178	kiorday4	snow	qɒr̯	q ɒ r �

tresoldi commented 6 years ago

That is what I was thinking: they are usually one symbol or always, by definition? Second question: shouldn't we at least warn that this is unexpected?

LinguList commented 6 years ago

Is that the 8000 lines sounds.tsv? this IS interesting, as it shows the problems of obvious ambiguity of orthoprofiles: we have kk similar to k: and k+superscript_j, but no kkj, etc. Although it SHOULD find ʤigar and the like.

Maybe, you should use graphemes.tsv instead (reducing identical ones).

xrotwang commented 6 years ago

Ah yes, that's better (although there are tons of duplicate graphemes in there):

ID	LANGUAGE	CONCEPT	FORM	SEGMENTS
1027	usoymahalla2	mountain	tɔɣ˳	t ɔ ɣ �
105	unovmetan3	two	ikkʲe	i kk � e
1086	usoymahalla3	good	jaxSə	j a x S ə
1175	usoymahalla3	two	ikkʲe	i kk � e
1208	usoymahalla3	mountain	tɔɣ˳	t ɔ ɣ �
1214	usoymahalla3	animal	hajwɔn	h a jw ɔ n
1266	usoymahalla1	good	jaxS1	j a x S 1
1310	usoymahalla1	leaf	barg˳	b a r g �
1357	usoymahalla1	two	ikkʲe	i kk � e
137	unovmetan3	mountain	tɔɣ˳	t ɔ ɣ �
1376	usoymahalla1	seed	uruɣ˳	u r u ɣ �
1419	usoymahalla1	green	zangʲɔrɪ	z a ng � ɔ r ɪ
1446	kikulanak1	good	dZaqS1	d Z a q S 1
1484	kikulanak1	year	ʤɪł	ʤ ɪ ł
1525	kikulanak1	one	bir̯	b i r �
1526	kikulanak1	toeat	ʤʲɛ	ʤ � ɛ

LinguList commented 6 years ago

In fact, I think that all lexibank datasets for which we MAKE an orthography profile that actually works, should be used to feed into the bad ortho-profile thing. The only problem are cases of explicit segmentation which is often mixed. Thus, a line where I segment, for example，

scheibe : sch ei b ə

may not be the one we want at a later stage. The problem is that orthoprofiles mix two aspects: explicit segmentation by replacement and segmentation by listing valid graphemes (grapheme = symbol sequence representing a sound). If they only listed graphemes, we could grow our little database of graphemes out of the data we annotate for lexibank.

xrotwang commented 6 years ago

Well, with my proposal at https://github.com/cldf/segments/issues/34 this would be the case, right?

xrotwang commented 6 years ago

@LinguList your example would then appear in the profile as

sch-ei-b-ə

LinguList commented 6 years ago

yes!

LinguList commented 6 years ago

This is another example for the usefulness of the new proposal.

SimonGreenhill commented 6 years ago

I've just hit this problem in grollemundbantu..

LinguList commented 6 years ago

well... grollemundbantu is among the worst cases of segmentation, right? They have tons of different orthographies there.

What I'd say for now is: eyeballing the one-ortho-profile-to-segment-them-all approach is less feasible than clean_string, as it assumes that all segments be known, while clean_string leaves consonants undefined (a large number of segments). The "+" marker instance, if not occuring alone, should convert to an unknown sound (as indicated in my proposal for re-coding above). The question of whether the base-profile-for-segmentation can be used in the future depends on the growth of our lexibank datasets and the expert-judged ortho-profiles. If some 50000 segments are NOT enough to yield the segmentations provided by clean_string in a similar quality, we may hold on to the approach for much longer than thought (or leave it always as an alternative solution).

SimonGreenhill commented 6 years ago

I like how optimistic you are that this is the worst case scenario! I'm sure we'll find worse...

cldf-clts / clts-legacy

Marker object has no attribute s #110