cldf-clts / pyclts

Apache License 2.0
10 stars 2 forks source link

Can the features package help to study features in CLTS? #21

Open LinguList opened 3 years ago

LinguList commented 3 years ago

The features package seems to be very convenient to explore feature systems, so I ask myself if it might be useful to see if it can be integrated, e.g., by either exporting our feature sets in clts-data to the formats required there, or by looking at an integration from within pyclts?

LinguList commented 3 years ago

It might even be possible to say that we store feature sets for our data in the ini-format required by the package for inter-operability in CLDF. I am also thinking here of other feature sets, like morphological features, etc., since we have them implicitly reflected so far in our data, etc., but not coded in a computer-readable form (e.g., deriving features in CLTS is a bit tedious, by iterating over sets, etc.).

tresoldi commented 3 years ago

I discussed this package some weeks ago with @XachaB (I think he used it in his thesis?), when I was preparing the matrix that eventually led to my hacky distfeat. The whole system is very nice and mature, but also considering this maturity I was not sure how easy or desirable it would be to integrate it with pyclts. My impression was that it would make sense the integrate the whole system for formal analysis (concepts) to the CLDF ecosystem, but it seemed something more appropriate to lexitools or equivalents.

This does not exclude the idea of adopting (preferably) a CLDF or at least a plain-tabular storage for features. It would help with code dealing with phonology and the more we demonstrate that CLDF is not only for wordlists, the better.

LinguList commented 3 years ago

Okay, we have about 5 if not more feature systems in CLTS datasets already. Including our CLTS data system. Do you have the time, @tresoldi, to code our CLTS feature system in Sebastian's features format? You could even write a blog post. If we have this one time, we could then do it for the other feature systems in CLTS as well. And from there, we could see what it may bring us, okay?

tresoldi commented 3 years ago

His format default format is a textual grid parsed in code (https://github.com/xflr6/features/blob/master/features/config.ini), but it accepts a string with the full contents of a CSV file as well.

Exporting out features into these formats should be easy, but maybe we should have our own CLDF-to-features module? Or it is best to start small, with CSV files, and later do a CLDF (as we'd also need to discuss the structure)?

LinguList commented 3 years ago

I do not want to overcomplicate it. I want one file for our CLTS feature system and some tests on how the algebra works. This can also be done in the form of a blog post. One could also compare the system with the system we find in phoible. The feature systems can be submitted in CSV format in CLTS, in the form of a pull request, and maybe an example, how they can be read into features. Any more complex discussions are not needed now, I think, as we are testing this.

tresoldi commented 3 years ago

Great, I'll do it, I have meaning to play with these libraries. :+1:

XachaB commented 3 years ago

I did not use it for my dissertation, instead I used directly his other package concepts: https://pypi.org/project/concepts/ which is just pure FCA.

The input format is simple enough, the steps are 1. compute dummy values (pandas has a method for it from dataframes, or you can re-code it by hand, it's quite simple), to get a binary matrix which serves as a Context for FCA, 2. replace the 1s by "X" and 0s by the empty string, and 3. pass it as a csv string to the library:

context_table = pd.get_dummies(feature_table, prefix_sep="=")
context_table = context_table.applymap(lambda x: "X" if x == 1 else "")
context_str = context_table.to_csv()
context = concepts.Context.fromstring(context_str, frmat='csv')
LinguList commented 3 years ago

Yep, I was aware concepts is a dependency. For the purpose of understandability, etc., however, I think it is useful to start from features in CLTS, also as a test for the suitability. So can I expect some kind of a blog post for October from you on that matter,then, @tresoldi?

tresoldi commented 3 years ago

Yes, code and post.

xrotwang commented 3 years ago

Thoughts on this @xflr6 ?

xflr6 commented 3 years ago

Thanks for adding me. +1 that FCA would be interesting here. :)

features is a relatively thin wrapper around concepts. It provides convenience for locating/loading contexts in INI-files and is a bit more strict, notably adding the assumption/requirement that feature systems need to allow to refer to each individual object (I think that is a useful requirement in practise but I might be biased doing morphology). So this might influence the decision.

I think many FCA tools read .cxt, and .csv also sounds good. Maybe it would make sense to add features.FeatureSystem.from_context() as an alternate constructor so features can benefit from all the input formats supported by concepts directly (recently added support for serialzing/loading (also) the lattice structure instead of generating it on the fly each time).

LinguList commented 3 years ago

Thanks, @xflr6. I see the situation now like this: We have started to collect data, and we have about 5 if not 10 collections where people propose their feature systems for phonology (Chomsky-Halle, Phoible, etc.). So far, these features are just there, and nobody uses them. They could be accessed in code, but not in a straightforward manner (see Chomsky and Halle in CLTS).

So I was thinking: a dedicated standardized way of representing feature systems plus a library that allows to do some typical stuff with features (and here, I do not know much about possibilities) would be good to help us further normalize/standardize what we have. And if successful, we might make it part of CLDF and go beyond distinctive features in phonology.

Depending on how well all of this works, one might also discuss some wrapper to read features from CLDF or to dump them to standard formats from a CSVW package (where one would have to discuss how features are handled in the metadata).

XachaB commented 3 years ago

This sounds great !

Note that you say "nobody uses them", which you might be happy to learn is incorrect ! I have been using them in a current work in progress, where I wanted to compare the impact of various feature systems for my specific task. It did require a bit of parsing each thing, but I use the features from panphon, phoible, nidaba and Chomsky and Halle, as well as @tresoldi 's proposition for a distinctive feature model which spans the entire BIPA range (this is also really useful for my work). Another table I know of is from Hayes's software Pheature Pad. I also compare to the bipa features, even though they may not be strictly comparable.

I see a lot of value in a dedicated library and would gladly contribute.