Closed LinguList closed 7 months ago
Other idea: just make a module with the features:
features.py
binary_features = {
"syl": "syllabic",
"cons": "consonantal",
...
}
binary_feature_list = ["cons", "syl", "son", "cont", "delrel", ...]
clts_features = {
"type": {
"consonant": {"syl": 0, "cons": 1, ...},
"vowel": {"syl": 1, "cons": 0, ...},
....
},
"place": {
"alveolar": {"ant": 1, "cor": 1, "distr": 0},
}
}
To import features then, you just type:
from clts2vec.features import clts_features
I had actually already thought about it, and wanted to store the feature definitions in a JSON file, once we have defined them all. But I might already set up a stub later today or next Monday.
Yes, as we are discussing abstract names here, no unicode symbols, I suggest to use pure Python now.
And please do not use the implicit plus + / -
for features, use boolean 1 / 0 or True / False, as shown above.
Okay, but - would correspond to -1, since we also want to keep the option to render a feature non-applicable, which would then receive the value 0.
Yes, just do not please, use implicit things, like + - that you need to then check for by string matches, if you can make a check for mathematical values directly.
I just created a module that contains literal Python dictionaries, as you had suggested. Is this roughly what you had in mind?
Also, there are some conditional feature mappings, and I am struggling with how to represent them, without relying on implicit notations. For example, the glottal stop is [+cg], but neither glottal sounds nor stops are [+cg] by themselves. There are also some features that only apply to certain natural classes; "strident" for example only applies to fricatives and affricates (and remain underspecified otherwise). Do you have an idea how to transparently represent conditional mappings? I presume that they will be much more important when it comes to encoding vowels, so it would be good to find a sensible representation now.
Let me check :)
How do you intend to parse and create the features from a feature string? I think this question is crucial now. I suppose, you want to start from some base features and from there you then want to get a first vector, which you then expand, based on additional features.
What you may of course also consider to do is to start from the feature set, or the "name" that we provide for clts.
So if you start with a simple example like:
>>> from pyclts import CLTS
>>> bipa = CLTS().bipa
>>> sound = bipa['ʔʰ']
>>> sound.name
'aspirated voiceless glottal stop consonant'
For consonants (the type is something you should check explicitly, as this is so easy to do), you could start in the following way:
def lookup_features(sound_name, lookup):
if sound_name in lookup:
return lookup[sound_name]
base_sound = sound_name.split()[-3:] # "glottal stop consonant" as a fictive base
if base_sound in lookup:
vec = lookup[base_sound]
rest = name.split(" ")[:-3]
else:
vec = lookup[name.split(" ")[-1] # "consonant"
rest = name.split(" ")[:-1]
for itm in rest:
new_vec = lookup[itm]
update_vec(vec, new_vec) # your way to combine here
return vec
Yes, the workflow would be the same as before. I start by setting up a feature vector with only zeros, and subsequently modify it based on the given CLTS features.
I start from the featureset
of the CLTS sound, which I then order according to the defined hierarchy, to make sure that features are applied in the correct order.
I haven‘t included a demonstration of that yet, but will write one later this afternoon and commit it.
What you can also do is (but this may become a bit difficult): you make a network representation of features (remember, feature names for feature values, such as "aspirated" etc. are all unique in CLTS!), and you parse down the tree. Either, you parse a whole sound, or you cannot parse again, so you must take the vector and then modify it.
Just figured, this may be too complicated, since we cannot assume that the name of a clts sound is ordered, so having this structure: lookup complete sound by name (internally use a frozenset when loading the dictionary!), if this failes, lookup base sound, if it fails, proceed according to your order.
For cases like "strident", you should ask yourself, how you want to proceed in the modification. I think it is no problem to with-held a set of vector items and make a final check to modify them?
Here's an example of how one can derive initial vectors.
from clts2vec.features import clts_features, binary_features
def parse(sound):
# use the order in clts
base_vec = clts_features["type"][sound.type]
for attr in sound._name_order:
val = getattr(sound, attr)
if attr in clts_features and val is not None:
new_vec = clts_features[attr][getattr(sound, attr)]
for k, v in new_vec.items():
if v in [1, -1]:
base_vec[k] = v
return [base_vec[f] for f in binary_features]
I added this as parse.py
in the clts2vec package for testing. Running it is easy:
from pyclts import CLTS
from clts2vec.parse import parse
bipa = CLTS().bipa
for sound in ["p", "t", "k", "pʰ", "f", "fʰ", "ʔ"]:
vec = parse(bipa[sound])
print(vec)
Output:
[1, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0]
[1, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0]
[1, 0, 0, 0, 0, 0, 0, -1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0]
[1, 0, 0, 1, 0, 0, 0, -1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0]
[1, 0, 0, 1, 0, 0, 0, -1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0]
[1, 0, 0, 1, 0, 0, 0, -1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0]
Is that what you wanted?
Yes, that looks good, although for consonants, we need to proceed in the reverse order, so that coarticulatory features are not overwritten. Consider the case of a devoiced nasal consonant: Nasal consonants are by default [+voi]; this should later be overwritten by "devoiced" to [-voi]. Secondary articulation must therefore always be applied after primary articulation.
But the more I think about it, the more I like the idea of using the order given by CLTS, but I think, it must be processed from the back. I would need to think about how/if this would work for vowels and diphthongs, though.
I will add and modify your suggested methods, thanks for testing that! :)
Sorry!
def parse(sound):
# use the order in clts
base_vec = clts_features["type"][sound.type].copy()
for attr in sound._name_order:
val = getattr(sound, attr)
if attr in clts_features and val is not None:
new_vec = clts_features[attr][val]
for k, v in new_vec.items():
if v in [1, -1]:
base_vec[k] = v
return base_vec, [base_vec[f] for f in binary_features]
Mind that you must copy
the dictionary, if not, you write over it ;-)
I think you should use your OWN logical order here and just proceed in this fashion. This is probably the easiest way to proceed, right?
Yes, I already noticed that :D
Okay, had you told me, I would not have spent the last 30 minutes looking for the problem.
It's probably more robust for me to define my own logical order, even if it coincides with large parts of the CLTS ordering. That way, it is easily adjustable for all different cases.
For the order: please define your own order, since the order in CLTS is not really based on usefulness for vector generation.
I am sorry, I just came back to work on it five minutes ago.
CLTS order is based on what we consider useful in writing a character in IPA or in calling it by its name.
But that makes the whole procedure VERY straighforward now. You can even put all in one file, and it's done, and we can directly include it in CLTS later.
It is already there: The clts_feature_hierarchy
dictionary in features.py
.
So you got your package then ;-)
You'd only have to add the specific procedure to check for base names to handle individual sounds (like glottal stop) and some individual fteaures post-hoc, right?
In my mind, I imagined an option where I could define that some feature values apply under a certain condition, as in: "glottal" is [+cg] IF "stop" is also present. Because this is a paradigm that will be very useful when dealing with vowels and diphthongs, where binary features frequently are assigned based on combinations of descriptive features. I am just not sure what the best representation for that would be.
Hm, I don't know if this is not getting maybe too complex later on.
However, think of the following representation:
"glottal": {
"cg": {"if": {"glottal": 1}, "else": 0}
}
(just an idea)
Then the parser is minimally to be modified as followed:
def parse(sound):
# use the order in clts
base_vec = clts_features["type"][sound.type].copy()
for attr in feature_hierarchy:
val = getattr(sound, attr)
if attr in clts_features and val is not None:
new_vec = clts_features[attr][val]
for k, v in new_vec.items():
if v in [1, -1]:
base_vec[k] = v
elif isinstance(v, dict):
base_vec[k] = v["else"]
for itm, value in v["if"].items():
if itm in sound.name:
base_vec[k] = v["if"][itm]
break
return base_vec, [base_vec[f] for f in binary_features]
Not sure, if this is very elegant and safe, but it would account for what you want, right?
We might need to encode more complex conditions, though. I had already written a parsing method condition_applies()
in io.py
that is able to handle basic predicate logic (supporting the operators AND, OR, NOT). The downside of this, of course, is that it requires us to parse a condition string; but I don't think representing conditions in terms of predicate logic is that much of a stretch.
Maybe I am taking a sledgehammer to crack a nut here, but my experience shows that especially in diphthongs, multiple descriptive features tend to interact in complex ways.
Yes, that is something I do want to avoid. It seems attractive at first as you think you can spare explicit writing of some characters that have some features that you cannot derive now, but I can tell you that these conditions when taken in combination are very difficult to process mentally, so I'd recommend to not do this.
But if you go for condition
, why not extend in this form:
"glottal": {
"cg": {
"conditions": {
[
[
{"glottal": True, "voiced": True},
1],
[
{},
0]
]
}
An alternative, more explicit approach would be to allow for feature pairs (or n-tuples) as keys of the feature dictionary. Then we could have something like
clts_feature_values = {
'glottal': {
'features': {...},
'domain': 'place'
},
('glottal', 'stop'): {
'features': {'cg': 1},
'domain': 'combined'
}
}
This looks nice, but you'd have a problem for checking now.
First: order of names should not be checked against the clts names, since they can change in the future, so it would have to resolve to the set characteristics.
Second: you'd have to check to get the largest subset of feature tuples that you find in the clts name (or the fronzenset). So it would mean you have to iterate over all tuples and then check what you match, or decide that you take the largest match, etc.
So, suppose you have a collection of feature tuples as keys with their vectors:
features = {
"glottal stop": {vector},
"voiced glottal stop": {vector},
"glottal": {vector},
"voiced": {vector}
}
So now you'd have to do something like:
You can of course write a nice while loop for this, but do you think that works?
The way I had imagined it is that the combined definitions do not contradict the base ones. So, in this case, ("glottal", "stop") would only account for [+cg], while the rest of the features are still derived from the base definitions of "glottal" and "stop". That would save us the ordering, checking for largest subset, checking if subsets overlap, etc. Long story short: If we design it so that the tuples and the singleton features are safe to apply (given the correct order), lookup should not be a problem.
We could just extract the keys that are of type tuple from the feature dictionary, and then check which ones apply to a given featureset.
But how do you match glottal and stop then? You still must iterate over the features, right? So my suggestion would be to declare these as "post-processing" then, and make another feature dictionary or a list of features. So first, you proceed in the way we outlined above, then you refine the feature vector for those cases where the features match. You could here even use absence, I would say.
Yes, we would still need to iterate over the features, but in this setting we are talking about O(n) complexity, because we don't need to go back and forth, match and modify, etc.
Those conditional/joint feature definitions, as far as I can tell, always apply to the base sound. If we process them post-hoc - so after processing the whole sound - I am worried that we might accidentally overwrite some modifications made by secondary articulatory modifiers. I think, the clean way therefore would be to process them between processing the base sound and the modifiers. Of course, nothing speaks against declaring those joint definitions in a separate dictionary.
3 iterations are also fine!
Of course it is computationally feasible, but I don't think it is necessary in the first place.
I think, the clean way therefore would be to process them between processing the base sound and the modifiers. Of course, nothing speaks against declaring those joint definitions in a separate dictionary.
These are 3 iterations, right?
Explicit is better than implicit.
The feature system you use, @arubehn, is not transparent from the code by now. My suggestion: make a json file, some dictionary, and give the feature plus some information. In general, with your data, it may be useful to use json, as it is easy to read from python, and you have basically some kind of a look-up-strucutre. It is a bit more verbose, requires a bit more hand-writing first, but it is more transparent.
You can read json then with the json-library.