clics / clicsbp

CLDF dataset on Body Part Colexifications
Creative Commons Attribution 4.0 International
1 stars 0 forks source link

Coverage for body part terms #14

Closed AnnikaTjuka closed 1 year ago

AnnikaTjuka commented 2 years ago

We need to check the coverage for body part terms. I need to think of a way how this can be best done. Do you have any ideas, how we make sure to have sampled enough body part concepts from the list? Mutual coverage, a minimal amount of a concept list? Maybe we discuss this in an issue?

Originally posted by @LinguList in https://github.com/clics/clicsbp/pull/13#pullrequestreview-815973046

LinguList commented 2 years ago

@AnnikaTjuka, coverage is discussed in our 2018 clics paper, where we introduce average mutual coverage. But obviously, alternative ways to measure coverage are possible. To get started, a list showing which concept is reflected in how many languages and how many families, would probably be useful. This can be derived from colexifications also with R, or from Python as part of the CLDFBench routine.

AnnikaTjuka commented 2 years ago

Ok, I'll try it in Python first and if I don't get anywhere, I'll use R. But should we integrate the "OR" concepts first? I assume that some lists only use "hand/arm" glosses.

LinguList commented 2 years ago

Yes, I can try and look at that later.

AnnikaTjuka commented 2 years ago

Great!

LinguList commented 2 years ago

@AnnikaTjuka, I have now added two versions of a coverage overview and also modified the language families (Dogon is a valid glottolog family, and Bangime is an isolate, so I replaced Bangime by Dogon, although some say Dogon is Atlantic-Congo, which is not clear, however).

LinguList commented 2 years ago

The coverage is now calculated upon running the first lexibank.makecldf script.

AnnikaTjuka commented 2 years ago

Great, thanks! I just ran the updated version of lexibank.makecldf and looked at the distribution. I think in some cases it is predictable that there are not so many instances (e.g. FINGERTIP). But FACE should occur in more than 14 language families. I'll check these concepts in addition to the emotion concepts in CLICS regarding phonetic transcriptions.

LinguList commented 2 years ago

@AnnikaTjuka, I have now added a script computing coverage.

LinguList commented 2 years ago
Languages Concepts Coverage Coverage Ratio Families Valid Languages Average Concepts
1446 1500 136.231 0.0908205 19 1442 316.757
1446 1400 135.963 0.0971164 19 1442 311.558
1446 1300 135.616 0.10432 19 1442 305.64
1446 1200 135.193 0.112661 19 1442 299.111
1446 1100 134.666 0.122423 19 1442 291.828
1446 1000 134.013 0.134013 19 1442 283.725
1446 900 133.189 0.147988 19 1442 274.636
1446 800 132.106 0.165132 19 1442 264.207
1446 700 130.695 0.186707 19 1442 252.324
1446 600 128.708 0.214514 19 1442 238.221
1446 500 125.922 0.251843 19 1442 221.534
1446 400 121.654 0.304135 19 1442 200.905
1446 300 114.782 0.382607 19 1442 174.77
1400 1500 142.432 0.0949549 19 1396 325.009
1400 1400 142.148 0.101534 19 1396 319.649
1400 1300 141.779 0.109061 19 1396 313.55
1400 1200 141.329 0.117774 19 1396 306.816
1400 1100 140.769 0.127972 19 1396 299.313
1400 1000 140.075 0.140075 19 1396 290.957
1400 900 139.199 0.154666 19 1396 281.586
1400 800 138.05 0.172563 19 1396 270.844
1400 700 136.552 0.195075 19 1396 258.604
1400 600 134.443 0.224072 19 1396 244.074
1400 500 131.486 0.262972 19 1396 226.884
1400 400 126.957 0.317392 19 1396 205.633
1400 300 119.675 0.398916 19 1396 178.726
1300 1500 152.71 0.101807 18 1293 341.913
1300 1400 152.38 0.108843 18 1293 336.145
1300 1300 151.953 0.116887 18 1293 329.582
1300 1200 151.432 0.126194 18 1293 322.337
1300 1100 150.784 0.137076 18 1293 314.261
1300 1000 149.981 0.149981 18 1293 305.276
1300 900 148.968 0.16552 18 1293 295.197
1300 800 147.64 0.18455 18 1293 283.648
1300 700 145.911 0.208444 18 1293 270.511
1300 600 143.478 0.23913 18 1293 254.92
1300 500 140.063 0.280126 18 1293 236.45
1300 400 134.857 0.337142 18 1293 213.677
1300 300 126.51 0.421699 18 1293 184.872
1200 1500 162.106 0.108071 17 1194 358.733
1200 1400 161.72 0.115514 17 1194 352.493
1200 1300 161.222 0.124017 17 1194 345.412
1200 1200 160.613 0.133844 17 1194 337.572
1200 1100 159.853 0.145321 17 1194 328.831
1200 1000 158.915 0.158915 17 1194 319.123
1200 900 157.736 0.175263 17 1194 308.249
1200 800 156.196 0.195245 17 1194 295.816
1200 700 154.184 0.220263 17 1194 281.646
1200 600 151.36 0.252267 17 1194 264.855
1200 500 147.381 0.294763 17 1194 244.921
1200 400 141.383 0.353456 17 1194 220.495
1200 300 131.77 0.439235 17 1194 189.591
1100 1500 169.75 0.113167 17 1094 375.909
1100 1400 169.291 0.120922 17 1094 369.108
1100 1300 168.7 0.129769 17 1094 361.388
1100 1200 167.975 0.139979 17 1094 352.842
1100 1100 167.072 0.151884 17 1094 343.312
1100 1000 165.959 0.165959 17 1094 332.733
1100 900 164.563 0.182848 17 1094 320.904
1100 800 162.742 0.203428 17 1094 307.389
1100 700 160.357 0.229082 17 1094 291.963
1100 600 157.01 0.261684 17 1094 273.686
1100 500 152.292 0.304584 17 1094 251.979
1100 400 145.221 0.363052 17 1094 225.48
1100 300 133.913 0.446376 17 1094 191.97
1000 1500 179.139 0.119426 17 994 395.187
1000 1400 178.588 0.127563 17 994 387.733
1000 1300 177.875 0.136827 17 994 379.256
1000 1200 177.001 0.147501 17 994 369.873
1000 1100 175.915 0.159923 17 994 359.424
1000 1000 174.578 0.174578 17 994 347.832
1000 900 172.907 0.192119 17 994 334.895
1000 800 170.725 0.213406 17 994 320.107
1000 700 167.865 0.239807 17 994 303.216
1000 600 163.84 0.273067 17 994 283.18
1000 500 158.188 0.316377 17 994 259.425
1000 400 149.817 0.374543 17 994 230.653
1000 300 136.463 0.454878 17 994 194.261
900 1500 190.197 0.126798 16 892 417.132
900 1400 189.519 0.135371 16 892 408.866
900 1300 188.642 0.145109 16 892 399.466
900 1200 187.567 0.156306 16 892 389.062
900 1100 186.232 0.169301 16 892 377.478
900 1000 184.589 0.184589 16 892 364.633
900 900 182.546 0.202829 16 892 350.336
900 800 179.881 0.224852 16 892 334
900 700 176.372 0.251961 16 892 315.298
900 600 171.438 0.285731 16 892 293.118
900 500 164.517 0.329034 16 892 266.836
900 400 154.357 0.385893 16 892 235.286
900 300 138.198 0.460661 16 892 195.321
800 1500 205.693 0.137129 16 793 442.289
800 1400 204.845 0.146318 16 793 433.044
800 1300 203.75 0.156731 16 793 422.548
800 1200 202.403 0.168669 16 793 410.915
800 1100 200.74 0.182491 16 793 397.998
800 1000 198.72 0.19872 16 793 383.779
800 900 196.255 0.218061 16 793 368.125
800 800 193.053 0.241316 16 793 350.267
800 700 188.797 0.26971 16 793 329.699
800 600 182.778 0.304631 16 793 305.231
800 500 174.464 0.348929 16 793 276.457
800 400 162.435 0.406089 16 793 242.251
800 300 143.643 0.478811 16 793 199.213
700 1500 226.668 0.151112 16 693 471.22
700 1400 225.57 0.161121 16 693 460.709
700 1300 224.153 0.172426 16 693 448.783
700 1200 222.405 0.185337 16 693 435.536
700 1100 220.291 0.200264 16 693 421.049
700 1000 217.744 0.217744 16 693 405.239
700 900 214.702 0.238558 16 693 388.17
700 800 210.836 0.263545 16 693 368.856
700 700 205.662 0.293803 16 693 346.389
700 600 198.266 0.330443 16 693 319.457
700 500 188.365 0.37673 16 693 288.344
700 400 174.257 0.435643 16 693 251.554
700 300 153.249 0.51083 16 693 206.269
600 1500 253.615 0.169076 16 593 506.833
600 1400 252.167 0.180119 16 593 494.802
600 1300 250.3 0.192538 16 593 481.167
600 1200 247.979 0.206649 16 593 465.967
600 1100 245.204 0.222913 16 593 449.488
600 1000 241.886 0.241886 16 593 431.685
600 900 237.932 0.264369 16 593 412.635
600 800 233.071 0.291339 16 593 391.437
600 700 226.441 0.323488 16 593 366.305
600 600 217.01 0.361683 16 593 336.182
600 500 204.639 0.409279 16 593 301.712
600 400 187.348 0.468369 16 593 261.313
600 300 162.924 0.543079 16 593 212.787
500 1500 283.783 0.189189 14 490 548.874
500 1400 281.763 0.201259 14 490 534.724
500 1300 279.378 0.214906 14 490 519.678
500 1200 276.1 0.230084 14 490 501.692
500 1100 272.395 0.247632 14 490 482.866
500 1000 267.906 0.267906 14 490 462.45
500 900 262.469 0.291632 14 490 440.41
500 800 255.999 0.319999 14 490 416.266
500 700 247.3 0.353286 14 490 387.816
500 600 234.759 0.391265 14 490 353.322
500 500 218.527 0.437054 14 490 313.974
500 400 196.696 0.49174 14 490 268.832
500 300 167.514 0.55838 14 490 215.912
400 1500 338.038 0.225358 9 391 609.42
400 1400 335.042 0.239315 9 391 592.533
400 1300 331.558 0.255044 9 391 575.4
400 1200 326.591 0.272159 9 391 553.577
400 1100 321.061 0.291874 9 391 531.168
400 1000 314.33 0.31433 9 391 506.815
400 900 306.06 0.340066 9 391 480.3
400 800 296.792 0.37099 9 391 452.195
400 700 284.168 0.405954 9 391 418.815
400 600 265.744 0.442907 9 391 377.707
400 500 242.803 0.485607 9 391 331.483
400 400 212.027 0.530067 9 391 278.572
400 300 173.896 0.579653 9 391 218.82
300 1500 420.935 0.280623 7 291 692.227
300 1400 416.024 0.29716 7 291 671.353
300 1300 410.424 0.315711 7 291 650.187
300 1200 402.308 0.335257 7 291 623.073
300 1100 393.177 0.357433 7 291 594.62
300 1000 382.422 0.382422 7 291 564.327
300 900 368.802 0.40978 7 291 530.6
300 800 355.102 0.443877 7 291 497.593
300 700 335.35 0.479072 7 291 456.627
300 600 307.18 0.511967 7 291 406.7
300 500 277.038 0.554076 7 291 354.883
300 400 237.02 0.59255 7 291 294.96
300 300 190.282 0.634272 7 291 229.33
LinguList commented 2 years ago

Of these options, we should choose one. Note that languages are now represented by unique glottocodes. This is an advantage, as it allows us to filter out those languages which have duplicated glottocodes for the analysis later on.

LinguList commented 2 years ago

I think something like 1400 languages and 1000 concepts may be useful. We should add one more language family, though.

AnnikaTjuka commented 1 year ago

I think this issue can be closed. I'd propose not making any more changes to the languages and concepts at this stage.