refine concept mapping algorithm and handling

LinguList commented 6 years ago

[ ] modify "map_concepts" to "map" or "link" in cli.py (underscore annoys too much)
[ ] check description for scores and add them to some tutorial
[ ] refine mapping for some inconsistent cases (see below)
[ ] don't convert all to lowercase, keep case for an additional layer of similarity (if all is identical)

The following is unexpected:

$ concepticon lookup aubergine
GLOSS   CONCEPTICON_ID  CONCEPTICON_GLOSS   SIMILARITY
aubergine   1146    AUBERGINE   4

$ Concepticon lookup "aubergine (noun)"
GLOSS   CONCEPTICON_ID  CONCEPTICON_GLOSS   SIMILARITY
aubergine (noun)    1146    AUBERGINE   3

Our rule says: if there is no pos-information, penalize this, but top score is only obtained upon identity:

$ Concepticon lookup "THE AUBERGINE"
GLOSS   CONCEPTICON_ID  CONCEPTICON_GLOSS   SIMILARITY
THE AUBERGINE   1146    AUBERGINE   1

There needs to be a better logic for the scores, and we should have a convincing scoring scheme...

LinguList commented 6 years ago

I just figured that the calculation of the self.frequencies of Concepticon is taking an extremely long time, since it is reading every list, which is hampering our automatic lookup. I would suggest to either store frequencies explicitly in a text-file, maybe in pyconcepticon/data/ and then recompute it once in a while, or to drop it completely (although frequencies are useful).

LinguList commented 2 years ago

We can argue that the pysem library offers a more consistent mapping now. We would only need to add cmd line functionality.

concepticon / concepticon-data

refine concept mapping algorithm and handling #382