cldf-clts / clts

Cross-Linguistic Transcription Systems
https://clts.clld.org
14 stars 3 forks source link

Invalid feature appears in sound's featuredict #104

Closed xrotwang closed 3 years ago

xrotwang commented 3 years ago

I haven't pinpointed exactly how and where this happens, but there a couple of sounds with type=='consonant' and a feature tongue_root: retracted-tongue-root in their featuredict. According to features.json this is illegal.

xrotwang commented 3 years ago

Here's an example:

>>> clts.transcriptiondata('phoible')
<pyclts.transcriptiondata.TranscriptionData object at 0x7fc9c5823940>
>>> p = clts.transcriptiondata('phoible')
>>> p['voiced retracted-tongue-root velar stop consonant']
'ɡ̙'
>>> s = p.resolve_grapheme(p['voiced retracted-tongue-root velar stop consonant'])
>>> s.featuredict
{'raising': None, 'relative_articulation': None, 'friction': None, 'articulation': None, 'preceding': None, 'syllabicity': None, 'nasalization': None, 'palatalization': None, 'labialization': None, 'velarization': None, 'pharyngealization': None, 'glottalization': None, 'aspiration': None, 'duration': None, 'release': None, 'voicing': None, 'creakiness': None, 'breathiness': None, 'phonation': 'voiced', 'laminality': None, 
'tongue_root': 'retracted-tongue-root', 'place': 'velar', 'ejection': None, 'airstream': None, 'manner': 'stop'}
LinguList commented 3 years ago

tongue_root: retracted-tongue-root is not illegal, we defined it as

  "consonant": {
    "relative_articulation": [
        "centralized",
        "mid-centralized",
        "advanced",
        "retracted",
    "retracted-tongue-root"
    ],

in the features.json

or what is illegal about it?

xrotwang commented 3 years ago

But that definition says "relative_articulation: retracted-tongue-root" is ok - not "tongue_root: retracted-tongue-root". The latter is only ok for vowels.

LinguList commented 3 years ago

Ah, mierda. I think what may happen is that the lookup fails! Feature lookup starts from feature value names that are mapped to the feature types directly. So the "resolve_grapheme" probably looks up the wrong feature. But this would also mean that we will have similar problems with long, which are also defined for vowels and consonants? I remember, in any case, that our feature collection may NOT distinguish types for features.

Ah, definitely: length is a common feature of vowel and consonant. Retracted is assigned to TWO different feature types for vowels vs. consonants.

So we should add another feature type to consonants, or we should change the name of retracted_tongue_root.

@cormacanderson should probably help here.

Adding a new feature for consonants would force us to modfy pyclts code.

xrotwang commented 3 years ago

I don't think lookup is the problem. featuredict is determined by the input data, as far as I can tell. Will investigate a bit more.

LinguList commented 3 years ago

TranscriptionSystem._feature_values (a dict), in transcriptionsystems.py, I think it will map to the first type that is served, assuming that one feature value resolves to only one feature type, which would also be logical.

xrotwang commented 3 years ago

I think the problem is here:

$ grep "tongue_root:" pkg/transcriptionsystems/bipa/consonants.tsv 
r̠̙ voiced  alveolar    trill       tongue_root:retracted-tongue-root,relative_articulation:retracted   
k̙  voiceless   velar   stop        tongue_root:retracted-tongue-root   
g̙  voiced  velar   stop        tongue_root:retracted-tongue-root   
d̙  voiced  alveolar    stop        tongue_root:retracted-tongue-root   
d̪̙ voiced  dental  stop        tongue_root:retracted-tongue-root   
t̪̙ voiceless   dental  stop        tongue_root:retracted-tongue-root   
ʕ̙  voiced  pharyngeal  fricative       tongue_root:retracted-tongue-root   
s̙  voiceless   alveolar    fricative       tongue_root:retracted-tongue-root,airstream:sibilant    
ð̙  voiceless   dental  fricative       tongue_root:retracted-tongue-root   
s̙ˤ voiceless   alveolar    fricative       tongue_root:retracted-tongue-root,airstream:sibilant,glottalization:glottalized 
ð̙ˤ voiceless   dental  fricative       tongue_root:retracted-tongue-root,glottalization:glottalized
xrotwang commented 3 years ago

tongue_root is a feature not defined for consonants - but it is used here.

LinguList commented 3 years ago

Autsch, that is even worse.

LinguList commented 3 years ago

Because it means we have to add the feature to the consonants class in models.py, and we have to check the write order, when tongue_root should be applied, and this is always a bit tedious.

xrotwang commented 3 years ago

But couldn't we just replace tongue_root with relative_articulation here? Assuming the value is correct - just not the feature?

xrotwang commented 3 years ago

It's actually nice how I determined this problem. I added primary keys to features.csv and turned featuredict into a list of foreign keys into features.csv. Boom, 6 cases where the foreign key constraint failed :)

LinguList commented 3 years ago

:-)

The question is what this will do with TS._feature_dict.

In [1]: from pyclts import CLTS

In [2]: bipa = CLTS().bipa

In [3]: bipa._feature_values['retracted-tongue-root']
Out[3]: 'tongue_root'
LinguList commented 3 years ago

since TS._feature_values can only give one type for a value, we'd have a type clash here, since we have different types for vowels and consonants in the feature value retracted-tongue-root. Even if this does not create clashes, it could any time, couldn't it? The feature_values are important for parsing from feature names (like voiced bilabial XXX consonant).

xrotwang commented 3 years ago

Hm, I don't quite understand. Using (type, feature, value) as primary key for features, we don't have any clashes yet. But I also don't understand why retracted-tongue-root was used as value for relative_articulation for consonants. It seems to be bolted on, somehow - even has a tab as whitespace in the file - not spaces like all other values.

LinguList commented 3 years ago

the reason was that we were in a rush to add new values, and @cormacanderson suggested to just put it there, so I did not have to add yet another stupic consonant feature there.

LinguList commented 3 years ago

But it does obviously NOT make sense in the relative_articulation group, so it should be handled by adding a new feature to the consonant class in models.py anyway, just for consistency. Tests should show if my intuition that the _feature_values lookup could suffer and yield errors, holds, but even if this is not the case, it would not make sense to go on with this as is, right?

xrotwang commented 3 years ago

Ok. So the proper fix would be moving it into a tongue_root feature for consonants, right? It only affects 97 graphemes overall, so i guess this would still work as v2.1.0 :)

LinguList commented 3 years ago

Yes, that is a relief ;)

xrotwang commented 3 years ago

Ok. Will try to fix this tomorrow.

LinguList commented 3 years ago

I feel bad if you have to fix this, but I'd only have time to do so from Tuesday, I am afraid.

cormacanderson commented 3 years ago

A new tongue root feature for consonants with two values seems to me to be principled. I remember not liking it being in relative articulation at the time, but we were under time pressure. Let me know what if anything I can do @xrotwang

xrotwang commented 3 years ago

Yes, that would be analogous to vowels, too. Will see whether I can add this in a backwards compatible way.

Cormac Anderson @.***> schrieb am Sa., 17. Apr. 2021, 23:10:

A new tongue root feature for consonants with two values seems to me to be principled. I remember not liking it being in relative articulation at the time, but we were under time pressure. Let me know what if anything I can do @xrotwang https://github.com/xrotwang

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cldf-clts/clts/issues/104#issuecomment-821888103, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKEAHJJZSK3VIVMQ44TTJH2LNANCNFSM43DIELIQ .

xrotwang commented 3 years ago

It seems that this solution has been intended all along: Just fixing features.json made it work. pyclts already had tongue_root as feature for consonants. So we might even want to release 2.0.1 with just that fix, and roll the other changes into 2.1.0 ?

LinguList commented 3 years ago

Yes! So the problem then was: we were rushing, and the person copy-pasting that feature into features-json (I bet it was me), did not paste it into the right place :(

xrotwang commented 3 years ago

So yes to v2.0.1 - or is v2.1.0 ok right away? I guess, v.2.0.0 has only been used in two lexibank datasets so far - tuled and one more, I think. And if we'd update these to v2.0.1, we might as well update to v.2.1.0 right away?

LinguList commented 3 years ago

Agree!

xrotwang commented 3 years ago

I'll release pyclts today, and then put together a final PR for clts v2.1.0.