clics / clics3

See also the Code Ocean capsule (https://codeocean.com/capsule/7201165/tree/v2) accompanying this project.
https://clics.clld.org
16 stars 3 forks source link

How to download a csv file that contains all colexifications? #17

Closed ianjoo closed 3 years ago

ianjoo commented 3 years ago

What is the terminal command that allows me to download all the colexifications, containing:

| ID A | Concept A | ID B | Concept B | Families | Languages | Words |                                                                                                                                      
|-------:|:-------------------------|-------:|:---------------------------|-----------:|------------:|--------:|
| 906 | TREE | 1803 | WOOD | 59 | 348 | 361 |
...
chrzyki commented 3 years ago

What you can do, for instance, while building the colexification network (see clics colexification -h) is redirecting the output to a file, for example:

clics -t 3 -f families colexification --show 3000 --format tsv > out.tsv
ianjoo commented 3 years ago

Thanks. But why 3000? What is the total number?

LinguList commented 3 years ago

You can also just look at the GML file in Python, which you can load with networkx or with igraph in Python (also R) in order to browse across all links in the network (which also have metadata), so you should be able to access all colexifications.

Additionally, please check if this post by @tresoldi is useful as it treats working with CLICS data from within Python (with code published on Zenodo): https://calc.hypotheses.org/2552

chrzyki commented 3 years ago

Thanks. But why 3000? What is the total number?

No particular reason other than that there are roughly 3000 concepts in CLICS and that, generally speaking, the less frequent colexifications also tend to be less reliable (However, of course note that number of concepts != to the number of colexifications in CLICS). network-3-families.gml in total has 4228 edges (note that this is before clustering with infomap), so in total there would be 4228 colexifications. The blog post that Mattis mentioned is a very good introduction to programmatically accessing the network data. Here's also a small snippet that shows how to access the data using igraph.

Note that the snippet is also based on @LinguList and @tresoldi's blog postings.

tresoldi commented 3 years ago

There is also some code from the "semantic distance" that I present at SLE2019 and discussed in another CALC blog post: https://github.com/tresoldi/semantic_distance

I think what you want is something similar to the full list ( https://github.com/tresoldi/semantic_distance/blob/master/data/colexifications.tsv ), but you should really compute it yourself, and @chrzyki 's snippet is clear. The data in this repository is outdated and includes all possible colexifications, including those found only between a single pair of languages, so that you have a lot of noise in there.

chrzyki commented 3 years ago

Closing this for now. Feel free to reopen should any other questions arise.