cldf-clts / soundvectors

MIT License
1 stars 0 forks source link

Specify all features in the vector in JSON #2

Closed LinguList closed 7 months ago

LinguList commented 9 months ago

The feature system you use, @arubehn, is not transparent from the code by now. My suggestion: make a json file, some dictionary, and give the feature plus some information. In general, with your data, it may be useful to use json, as it is easy to read from python, and you have basically some kind of a look-up-strucutre. It is a bit more verbose, requires a bit more hand-writing first, but it is more transparent.

You can read json then with the json-library.

LinguList commented 9 months ago

Other idea: just make a module with the features:

features.py

binary_features = {
    "syl": "syllabic",
    "cons": "consonantal",
    ...
}

binary_feature_list = ["cons", "syl", "son", "cont", "delrel", ...]

clts_features = {
    "type": {
        "consonant": {"syl": 0, "cons": 1, ...},
        "vowel": {"syl": 1, "cons": 0, ...},
        ....
    },
    "place": {
        "alveolar": {"ant": 1, "cor": 1, "distr": 0},
    }
}
LinguList commented 9 months ago

To import features then, you just type:

from clts2vec.features import clts_features
arubehn commented 9 months ago

I had actually already thought about it, and wanted to store the feature definitions in a JSON file, once we have defined them all. But I might already set up a stub later today or next Monday.

LinguList commented 9 months ago

Yes, as we are discussing abstract names here, no unicode symbols, I suggest to use pure Python now.

LinguList commented 9 months ago

And please do not use the implicit plus + / - for features, use boolean 1 / 0 or True / False, as shown above.

arubehn commented 9 months ago

Okay, but - would correspond to -1, since we also want to keep the option to render a feature non-applicable, which would then receive the value 0.

LinguList commented 9 months ago

Yes, just do not please, use implicit things, like + - that you need to then check for by string matches, if you can make a check for mathematical values directly.

arubehn commented 9 months ago

I just created a module that contains literal Python dictionaries, as you had suggested. Is this roughly what you had in mind?

Also, there are some conditional feature mappings, and I am struggling with how to represent them, without relying on implicit notations. For example, the glottal stop is [+cg], but neither glottal sounds nor stops are [+cg] by themselves. There are also some features that only apply to certain natural classes; "strident" for example only applies to fricatives and affricates (and remain underspecified otherwise). Do you have an idea how to transparently represent conditional mappings? I presume that they will be much more important when it comes to encoding vowels, so it would be good to find a sensible representation now.

LinguList commented 9 months ago

Let me check :)

LinguList commented 9 months ago

How do you intend to parse and create the features from a feature string? I think this question is crucial now. I suppose, you want to start from some base features and from there you then want to get a first vector, which you then expand, based on additional features.

What you may of course also consider to do is to start from the feature set, or the "name" that we provide for clts.

So if you start with a simple example like:

>>> from pyclts import CLTS
>>> bipa = CLTS().bipa
>>> sound = bipa['ʔʰ']
>>> sound.name
'aspirated voiceless glottal stop consonant'
LinguList commented 9 months ago

For consonants (the type is something you should check explicitly, as this is so easy to do), you could start in the following way:

def lookup_features(sound_name, lookup):
    if sound_name in lookup:
        return lookup[sound_name]
    base_sound = sound_name.split()[-3:] # "glottal stop consonant" as a fictive base
    if base_sound in lookup:
        vec = lookup[base_sound]
        rest = name.split(" ")[:-3]
    else:
        vec = lookup[name.split(" ")[-1] # "consonant"
        rest = name.split(" ")[:-1]
    for itm in rest:
        new_vec = lookup[itm]
        update_vec(vec, new_vec) # your way to combine here
    return vec
arubehn commented 9 months ago

Yes, the workflow would be the same as before. I start by setting up a feature vector with only zeros, and subsequently modify it based on the given CLTS features.

I start from the featureset of the CLTS sound, which I then order according to the defined hierarchy, to make sure that features are applied in the correct order.

I haven‘t included a demonstration of that yet, but will write one later this afternoon and commit it.

LinguList commented 9 months ago

What you can also do is (but this may become a bit difficult): you make a network representation of features (remember, feature names for feature values, such as "aspirated" etc. are all unique in CLTS!), and you parse down the tree. Either, you parse a whole sound, or you cannot parse again, so you must take the vector and then modify it.

LinguList commented 9 months ago

Just figured, this may be too complicated, since we cannot assume that the name of a clts sound is ordered, so having this structure: lookup complete sound by name (internally use a frozenset when loading the dictionary!), if this failes, lookup base sound, if it fails, proceed according to your order.

LinguList commented 9 months ago

For cases like "strident", you should ask yourself, how you want to proceed in the modification. I think it is no problem to with-held a set of vector items and make a final check to modify them?

LinguList commented 9 months ago

Here's an example of how one can derive initial vectors.

from clts2vec.features import clts_features, binary_features

def parse(sound):
    # use the order in clts
    base_vec = clts_features["type"][sound.type]
    for attr in sound._name_order:
        val = getattr(sound, attr)
        if attr in clts_features and val is not None:
            new_vec = clts_features[attr][getattr(sound, attr)]
            for k, v in new_vec.items():
                if v in [1, -1]:
                    base_vec[k] = v
    return [base_vec[f] for f in binary_features]
LinguList commented 9 months ago

I added this as parse.py in the clts2vec package for testing. Running it is easy:

from pyclts import CLTS
from clts2vec.parse import parse

bipa = CLTS().bipa

for sound in ["p", "t", "k", "pʰ", "f", "fʰ", "ʔ"]:
    vec = parse(bipa[sound])
    print(vec)

Output:

[1, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0]
[1, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0]
[1, 0, 0, 0, 0, 0, 0, -1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0]
[1, 0, 0, 1, 0, 0, 0, -1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0]
[1, 0, 0, 1, 0, 0, 0, -1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0]
[1, 0, 0, 1, 0, 0, 0, -1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0]
LinguList commented 9 months ago

Is that what you wanted?

arubehn commented 9 months ago

Yes, that looks good, although for consonants, we need to proceed in the reverse order, so that coarticulatory features are not overwritten. Consider the case of a devoiced nasal consonant: Nasal consonants are by default [+voi]; this should later be overwritten by "devoiced" to [-voi]. Secondary articulation must therefore always be applied after primary articulation.

But the more I think about it, the more I like the idea of using the order given by CLTS, but I think, it must be processed from the back. I would need to think about how/if this would work for vowels and diphthongs, though.

I will add and modify your suggested methods, thanks for testing that! :)

LinguList commented 9 months ago

Sorry!

LinguList commented 9 months ago
def parse(sound):
    # use the order in clts
    base_vec = clts_features["type"][sound.type].copy()
    for attr in sound._name_order:
        val = getattr(sound, attr)
        if attr in clts_features and val is not None:
            new_vec = clts_features[attr][val]
            for k, v in new_vec.items():
                if v in [1, -1]:
                    base_vec[k] = v
    return base_vec, [base_vec[f] for f in binary_features]

Mind that you must copy the dictionary, if not, you write over it ;-)

LinguList commented 9 months ago

I think you should use your OWN logical order here and just proceed in this fashion. This is probably the easiest way to proceed, right?

arubehn commented 9 months ago

Yes, I already noticed that :D

LinguList commented 9 months ago

Okay, had you told me, I would not have spent the last 30 minutes looking for the problem.

arubehn commented 9 months ago

It's probably more robust for me to define my own logical order, even if it coincides with large parts of the CLTS ordering. That way, it is easily adjustable for all different cases.

LinguList commented 9 months ago

For the order: please define your own order, since the order in CLTS is not really based on usefulness for vector generation.

arubehn commented 9 months ago

I am sorry, I just came back to work on it five minutes ago.

LinguList commented 9 months ago

CLTS order is based on what we consider useful in writing a character in IPA or in calling it by its name.

LinguList commented 9 months ago

But that makes the whole procedure VERY straighforward now. You can even put all in one file, and it's done, and we can directly include it in CLTS later.

arubehn commented 9 months ago

It is already there: The clts_feature_hierarchy dictionary in features.py.

LinguList commented 9 months ago

So you got your package then ;-)

LinguList commented 9 months ago

You'd only have to add the specific procedure to check for base names to handle individual sounds (like glottal stop) and some individual fteaures post-hoc, right?

arubehn commented 9 months ago

In my mind, I imagined an option where I could define that some feature values apply under a certain condition, as in: "glottal" is [+cg] IF "stop" is also present. Because this is a paradigm that will be very useful when dealing with vowels and diphthongs, where binary features frequently are assigned based on combinations of descriptive features. I am just not sure what the best representation for that would be.

LinguList commented 9 months ago

Hm, I don't know if this is not getting maybe too complex later on.

LinguList commented 9 months ago

However, think of the following representation:

"glottal": {
  "cg": {"if": {"glottal": 1}, "else": 0}
}

(just an idea)

Then the parser is minimally to be modified as followed:

def parse(sound):
    # use the order in clts
    base_vec = clts_features["type"][sound.type].copy()
    for attr in feature_hierarchy:
        val = getattr(sound, attr)
        if attr in clts_features and val is not None:
            new_vec = clts_features[attr][val]
            for k, v in new_vec.items():
                if v in [1, -1]:
                    base_vec[k] = v
                elif isinstance(v, dict): 
                    base_vec[k] = v["else"]
                    for itm, value in v["if"].items():
                        if itm in sound.name:
                            base_vec[k] = v["if"][itm]
                            break

    return base_vec, [base_vec[f] for f in binary_features]
LinguList commented 9 months ago

Not sure, if this is very elegant and safe, but it would account for what you want, right?

arubehn commented 9 months ago

We might need to encode more complex conditions, though. I had already written a parsing method condition_applies() in io.py that is able to handle basic predicate logic (supporting the operators AND, OR, NOT). The downside of this, of course, is that it requires us to parse a condition string; but I don't think representing conditions in terms of predicate logic is that much of a stretch. Maybe I am taking a sledgehammer to crack a nut here, but my experience shows that especially in diphthongs, multiple descriptive features tend to interact in complex ways.

LinguList commented 9 months ago

Yes, that is something I do want to avoid. It seems attractive at first as you think you can spare explicit writing of some characters that have some features that you cannot derive now, but I can tell you that these conditions when taken in combination are very difficult to process mentally, so I'd recommend to not do this.

LinguList commented 9 months ago

But if you go for condition, why not extend in this form:

"glottal": {
  "cg": {
    "conditions": {
        [
          [
            {"glottal": True, "voiced": True},
            1],
          [
            {},
            0]
        ]
}
arubehn commented 9 months ago

An alternative, more explicit approach would be to allow for feature pairs (or n-tuples) as keys of the feature dictionary. Then we could have something like

clts_feature_values = {
    'glottal': {
        'features': {...},
        'domain': 'place'
    },
    ('glottal', 'stop'): {
        'features': {'cg': 1},
         'domain': 'combined'
    }
}
LinguList commented 9 months ago

This looks nice, but you'd have a problem for checking now.

First: order of names should not be checked against the clts names, since they can change in the future, so it would have to resolve to the set characteristics.

Second: you'd have to check to get the largest subset of feature tuples that you find in the clts name (or the fronzenset). So it would mean you have to iterate over all tuples and then check what you match, or decide that you take the largest match, etc.

LinguList commented 9 months ago

So, suppose you have a collection of feature tuples as keys with their vectors:

features = {
  "glottal stop": {vector},
  "voiced glottal stop": {vector},
  "glottal": {vector},
  "voiced": {vector}
}

So now you'd have to do something like:

You can of course write a nice while loop for this, but do you think that works?

arubehn commented 9 months ago

The way I had imagined it is that the combined definitions do not contradict the base ones. So, in this case, ("glottal", "stop") would only account for [+cg], while the rest of the features are still derived from the base definitions of "glottal" and "stop". That would save us the ordering, checking for largest subset, checking if subsets overlap, etc. Long story short: If we design it so that the tuples and the singleton features are safe to apply (given the correct order), lookup should not be a problem.

arubehn commented 9 months ago

We could just extract the keys that are of type tuple from the feature dictionary, and then check which ones apply to a given featureset.

LinguList commented 9 months ago

But how do you match glottal and stop then? You still must iterate over the features, right? So my suggestion would be to declare these as "post-processing" then, and make another feature dictionary or a list of features. So first, you proceed in the way we outlined above, then you refine the feature vector for those cases where the features match. You could here even use absence, I would say.

arubehn commented 9 months ago

Yes, we would still need to iterate over the features, but in this setting we are talking about O(n) complexity, because we don't need to go back and forth, match and modify, etc.

Those conditional/joint feature definitions, as far as I can tell, always apply to the base sound. If we process them post-hoc - so after processing the whole sound - I am worried that we might accidentally overwrite some modifications made by secondary articulatory modifiers. I think, the clean way therefore would be to process them between processing the base sound and the modifiers. Of course, nothing speaks against declaring those joint definitions in a separate dictionary.

LinguList commented 9 months ago

3 iterations are also fine!

arubehn commented 9 months ago

Of course it is computationally feasible, but I don't think it is necessary in the first place.

LinguList commented 9 months ago

I think, the clean way therefore would be to process them between processing the base sound and the modifiers. Of course, nothing speaks against declaring those joint definitions in a separate dictionary.

These are 3 iterations, right?

LinguList commented 9 months ago

Explicit is better than implicit.