concepticon / concepticon-data

The curation repository for the data behind Concepticon.
https://concepticon.clld.org
32 stars 36 forks source link

refine concept mapping algorithm and handling #382

Open LinguList opened 6 years ago

LinguList commented 6 years ago

The following is unexpected:

$ concepticon lookup aubergine
GLOSS   CONCEPTICON_ID  CONCEPTICON_GLOSS   SIMILARITY
aubergine   1146    AUBERGINE   4

$ Concepticon lookup "aubergine (noun)"
GLOSS   CONCEPTICON_ID  CONCEPTICON_GLOSS   SIMILARITY
aubergine (noun)    1146    AUBERGINE   3

Our rule says: if there is no pos-information, penalize this, but top score is only obtained upon identity:

$ Concepticon lookup "THE AUBERGINE"
GLOSS   CONCEPTICON_ID  CONCEPTICON_GLOSS   SIMILARITY
THE AUBERGINE   1146    AUBERGINE   1

There needs to be a better logic for the scores, and we should have a convincing scoring scheme...

LinguList commented 6 years ago

I just figured that the calculation of the self.frequencies of Concepticon is taking an extremely long time, since it is reading every list, which is hampering our automatic lookup. I would suggest to either store frequencies explicitly in a text-file, maybe in pyconcepticon/data/ and then recompute it once in a while, or to drop it completely (although frequencies are useful).

LinguList commented 2 years ago

We can argue that the pysem library offers a more consistent mapping now. We would only need to add cmd line functionality.