Closed xrotwang closed 4 years ago
Here's the list as parsed from Comparalex:
The list parsed from comparalex and automatically mapped:
Stats are not that good: 59% only. And there are a couple of difficult things like
Williamson-1973-134-107 093 tongue: language langue, langage ???
Williamson-1973-134-108 093 tongue: language langue ???
or
Williamson-1973-134-117 102 woman: wife femme ???
Williamson-1973-134-118 102 woman: wife femme, épouse ???
i.e. almost identical concepts appearing twice.
well, the gloss-function in the concepticon takes as input, as far as I know, a set of "separators", like "/" etc., which could be tweaked (in this case: colon, to split into two).
But for 134 concepts, it is straightforward to refine manually, leaving unlinkable things unlinked (adding many new concepts only for this source is not my priority, as I'd focus on SEA languages, so the question is our concepticon policy for unlinked percentages, and we don't have one so far, right?).
No. I don't think so. Intuitively, I'd say >95% linked is a requirement.
Choosing --language=fr
gets 3% :)
Sad. We need to suck in all of comparalex in order to incrase the French mappings. I'd opt to go down to 90% for linking, in order to allow for very specific but partially interesting datasets, like the one by Ogden (on simple English), etc.
I think it's easiest here to have a quick look on the missing items and manually correct them. I'll have a look later the day, if you think it's worth checking in the context of lexibank.
Splitting at :
as well gets us 88%. So if you'd want to tackle the remaining problems, rather start with that
Allright! I'll reply in about 1 hour.
Just one important thing I realized: the comparalex-people split the concepts in the source, creating n-n mapping! So they have two rows for arm: hand
, one with French gloss "main" and one with "bras", which is of course not good. Question is: when linking ourselves, which one do we keep? Is this important in the lexibank context? And how do we link? Do I link to the French gloss or to the English one?
I see this not as a n:n mapping, but rather as cases where the english gloss is underspecified. After all, they do have individual IDs and distinct french glosses. So I'd say we link as specific as any of the glosses allow.
There are more problems: they have three times the same glosses, such as:
Williamson-1973-134-94 082 skin: hide:: bark peau (d'homme) 2127 BARK OR SKIN 4
Williamson-1973-134-95 082 skin: hide:: bark peau (d'animal) 2127 BARK OR SKIN 4
Williamson-1973-134-96 082 skin: hide:: bark écorce (d'arbre) 2127 BARK OR SKIN 4
but this is a clear interpretation 1 in the source to n in their internal mapping. I guess we can refine the mapping (I'm almost through), but I'd not add it to concepticon, as I do not know how much this differs from the original.
Ah, and the problem here is: we do not know whether it is the source that is underspecified (most likely) or their French translation. So the important question is where we find the original, so that we can see what, for example, the colons mean in the concept label...
Ok. We can also add this list as datasets/benuecongo/concepts.csv
for the time being.
Yes, my preferred way to proceed: just add the concept-list, but add a not that it's a bit shaky, especially since we do not know why they re-annotated it so heavily.
We may have a look at which concepts actually show up in the data.
for the time being, I'll make a rough mapping, as the lack of a PDF to check with the real data will make it difficult to account for the normal concepticon standards. If properly annotated, this will be transparent in lexibank in the NOTE.md.
Here's my version (we'd had to add some 10 new concept sets to properly integrate and we'd need the original PDF to understand some other 5 entries):
okay, just received thedata in scanned form...
Closed in favour of https://github.com/concepticon/concepticon-data/issues/878.
Williamson's Benue-Congo Comparative Word List seems to be one of the few lists in Comparalex, where the actual wordlists are also available. Thus, we should add it to Concepticon as well as lexibank.
'Notes (from Comparalex): The "Benue-Congo Comparative Wordlist" has been used in the study of non-Bantu Benue-Congo languages of Nigeria and Cameroon.