Open LinguList opened 2 years ago
One more thing we should look into then is that we de-merge the OR concepts upon creation of the big dataset. We can also ignore this, but there is some advantage in saying: ARM OR HAND should be rendered as ARM and HAND. This means, the CLDF creation in Lexibank-style of the big CLICS dataset would have to do this in the cldfbench/lexibank-code, as we do in clics/clicsbp.
We do this with an "unmerge" list (that we create manually for a given dataset): https://github.com/clics/clicsbp/blob/fd571023865366e5be654d6ff05f1f36dcba1272/lexibank_clicsbp.py#L111-L120
I would suggest we do the same when creating CLICS4, even if it is a bit crafty, but I think it is better to have a well-curated CLDF dataset for clics4. I'd also restrict concepts this time to some top 1500 in terms of coverage and do the same with languages.
For clics4, we will have some 52 datasets, all segmented and therefore analyzable with LingPy cognate detection methods. This means, we can offer enhanced networks (which require to integrate code that has been written but not yet for pyclics):
Given that we were asked for certain aspects regarding the CLICS data, where the data online is different from the data we report in concepticon (e.g., weighted degree, etc.), it would this time also be good to compute the concepticon table (or norare-table) directly when computing clics, so we have a concrete reference, and no hidden script that runs on one's computer and is not officially shared. So, when doing the colexification search, we should additionally:
All in all, this is SOME work to be done.
To explain the sub-graph issue: we had some users asking why data on the website is different from the data in the concepticon version of CLICS3 (Rzymski-2020-XXXX list).