concepticon / concepticon-data

The curation repository for the data behind Concepticon.
https://concepticon.clld.org
35 stars 37 forks source link

Williamson-1973-134 #278

Closed xrotwang closed 4 years ago

xrotwang commented 7 years ago

Williamson's Benue-Congo Comparative Word List seems to be one of the few lists in Comparalex, where the actual wordlists are also available. Thus, we should add it to Concepticon as well as lexibank.

Williamson, Kay and Kiyoshi Shimizu (eds.). 1968. Benue-Congo Comparative Wordlist Volume I. Ibadan, Nigeria: West African Linguistic Society. Williamson, Kay (ed.). 1973. Benue-Congo Comparative Wordlist, Volume II. Ibadan, Nigeria: West African Linguistic Society.

'Notes (from Comparalex): The "Benue-Congo Comparative Wordlist" has been used in the study of non-Bantu Benue-Congo languages of Nigeria and Cameroon.

xrotwang commented 7 years ago

Here's the list as parsed from Comparalex:

Williamson-1973-134.tsv.txt

xrotwang commented 7 years ago

The list parsed from comparalex and automatically mapped:

Williamson-1973-134.tsv.txt

Stats are not that good: 59% only. And there are a couple of difficult things like

Williamson-1973-134-107 093 tongue: language    langue, langage     ??? 
Williamson-1973-134-108 093 tongue: language    langue      ??? 

or

Williamson-1973-134-117 102 woman: wife femme       ??? 
Williamson-1973-134-118 102 woman: wife femme, épouse       ??? 

i.e. almost identical concepts appearing twice.

LinguList commented 7 years ago

well, the gloss-function in the concepticon takes as input, as far as I know, a set of "separators", like "/" etc., which could be tweaked (in this case: colon, to split into two).

But for 134 concepts, it is straightforward to refine manually, leaving unlinkable things unlinked (adding many new concepts only for this source is not my priority, as I'd focus on SEA languages, so the question is our concepticon policy for unlinked percentages, and we don't have one so far, right?).

xrotwang commented 7 years ago

No. I don't think so. Intuitively, I'd say >95% linked is a requirement.

xrotwang commented 7 years ago

Choosing --language=fr gets 3% :)

LinguList commented 7 years ago

Sad. We need to suck in all of comparalex in order to incrase the French mappings. I'd opt to go down to 90% for linking, in order to allow for very specific but partially interesting datasets, like the one by Ogden (on simple English), etc.

I think it's easiest here to have a quick look on the missing items and manually correct them. I'll have a look later the day, if you think it's worth checking in the context of lexibank.

xrotwang commented 7 years ago

Splitting at : as well gets us 88%. So if you'd want to tackle the remaining problems, rather start with that

Williamson-1973-134.tsv.txt

LinguList commented 7 years ago

Allright! I'll reply in about 1 hour.

LinguList commented 7 years ago

Just one important thing I realized: the comparalex-people split the concepts in the source, creating n-n mapping! So they have two rows for arm: hand, one with French gloss "main" and one with "bras", which is of course not good. Question is: when linking ourselves, which one do we keep? Is this important in the lexibank context? And how do we link? Do I link to the French gloss or to the English one?

xrotwang commented 7 years ago

I see this not as a n:n mapping, but rather as cases where the english gloss is underspecified. After all, they do have individual IDs and distinct french glosses. So I'd say we link as specific as any of the glosses allow.

LinguList commented 7 years ago

There are more problems: they have three times the same glosses, such as:

Williamson-1973-134-94  082 skin: hide:: bark   peau (d'homme)  2127    BARK OR SKIN    4
Williamson-1973-134-95  082 skin: hide:: bark   peau (d'animal) 2127    BARK OR SKIN    4
Williamson-1973-134-96  082 skin: hide:: bark   écorce (d'arbre)    2127    BARK OR SKIN    4

but this is a clear interpretation 1 in the source to n in their internal mapping. I guess we can refine the mapping (I'm almost through), but I'd not add it to concepticon, as I do not know how much this differs from the original.

LinguList commented 7 years ago

Ah, and the problem here is: we do not know whether it is the source that is underspecified (most likely) or their French translation. So the important question is where we find the original, so that we can see what, for example, the colons mean in the concept label...

xrotwang commented 7 years ago

Ok. We can also add this list as datasets/benuecongo/concepts.csv for the time being.

LinguList commented 7 years ago

Yes, my preferred way to proceed: just add the concept-list, but add a not that it's a bit shaky, especially since we do not know why they re-annotated it so heavily.

xrotwang commented 7 years ago

We may have a look at which concepts actually show up in the data.

LinguList commented 7 years ago

for the time being, I'll make a rough mapping, as the lack of a PDF to check with the real data will make it difficult to account for the normal concepticon standards. If properly annotated, this will be transparent in lexibank in the NOTE.md.

Here's my version (we'd had to add some 10 new concept sets to properly integrate and we'd need the original PDF to understand some other 5 entries):

Williamson-1973-134.tsv.txt

LinguList commented 7 years ago

okay, just received thedata in scanned form...

chrzyki commented 4 years ago

Closed in favour of https://github.com/concepticon/concepticon-data/issues/878.