clics / pyclics

python package implementing the CLICS processing workflow
Apache License 2.0
3 stars 0 forks source link

Update for CLICS4 #29

Open LinguList opened 2 years ago

LinguList commented 2 years ago

For clics4, we will have some 52 datasets, all segmented and therefore analyzable with LingPy cognate detection methods. This means, we can offer enhanced networks (which require to integrate code that has been written but not yet for pyclics):

  1. code for the identification of cognates among colexifications in the same family (https://github.com/clics/clicsbp/blob/fd571023865366e5be654d6ff05f1f36dcba1272/clicsbpcommands/colexifications.py#L173-L217
  2. code for the computation of weights using random walks (this will increase the paths among concepts through neighbors and could be useful for semantic metrics in the future, but it is not clear how feasible it is to run it on all data: https://github.com/clics/clicsbp/blob/fd571023865366e5be654d6ff05f1f36dcba1272/clicsbpcommands/colexifications.py#L127-L167

Given that we were asked for certain aspects regarding the CLICS data, where the data online is different from the data we report in concepticon (e.g., weighted degree, etc.), it would this time also be good to compute the concepticon table (or norare-table) directly when computing clics, so we have a concrete reference, and no hidden script that runs on one's computer and is not officially shared. So, when doing the colexification search, we should additionally:

  1. computes statistics (weighted degree, degree)
  2. run the subgraph method, which is now directly run in CLLD also in the Python code, to determine the sub-graphs

All in all, this is SOME work to be done.

To explain the sub-graph issue: we had some users asking why data on the website is different from the data in the concepticon version of CLICS3 (Rzymski-2020-XXXX list).

LinguList commented 2 years ago

One more thing we should look into then is that we de-merge the OR concepts upon creation of the big dataset. We can also ignore this, but there is some advantage in saying: ARM OR HAND should be rendered as ARM and HAND. This means, the CLDF creation in Lexibank-style of the big CLICS dataset would have to do this in the cldfbench/lexibank-code, as we do in clics/clicsbp.

LinguList commented 2 years ago

We do this with an "unmerge" list (that we create manually for a given dataset): https://github.com/clics/clicsbp/blob/fd571023865366e5be654d6ff05f1f36dcba1272/lexibank_clicsbp.py#L111-L120

LinguList commented 2 years ago

I would suggest we do the same when creating CLICS4, even if it is a bit crafty, but I think it is better to have a well-curated CLDF dataset for clics4. I'd also restrict concepts this time to some top 1500 in terms of coverage and do the same with languages.