Relatedness score of 0.0 for every Thai word pairs

alexpulich commented 4 years ago

Hi! Thank you for your work.

Currently facing an issue with /relatedness endpoint constantly returning score of 0.0 for every Thai word pairs I try. At the same time it works great for English, as an example.

For instance, for a request http://api.conceptnet.io/relatedness?node1=/c/th/ไม้&node2=/c/th/ป่า I used to get score of 0.375 (saved from previous experiments I did), but now the response is

{ "@context": [ "http://api.conceptnet.io/ld/conceptnet5.7/context.ld.json" ], "@id": "/relatedness?node1=/c/th/ไม้&node2=/c/th/ป่า", "value": 0.0 }

rspeer commented 4 years ago

Thanks for the bug report! I want to try to track this down, but I'm finding myself pretty confused that we ever returned non-zero results in Thai through the /relatedness web API.

We're rolling out ConceptNet 5.8 so I can understand that something might have been broken, but this is something that hasn't changed recently. The Web API uses a memory-constrained, "miniaturized" version of ConceptNet Numberbatch. It prunes the vocabulary and limits it to the top 10 languages in ConceptNet (en, fr, de, it, es, ru, pt, ja, zh, nl).

You should get better results, and any results at all in Thai, if you use the downloadable conceptnet-numberbatch embeddings, such as version 19.08 from: https://github.com/commonsense/conceptnet-numberbatch

But I still want to understand -- was relatedness in Thai actually working in the web API until recently?

alexpulich commented 4 years ago

Thank you for you advice, I will check it out.

As to reproducibility, I used a small function I wrote to request your API:

BASE_URL = 'http://api.conceptnet.io/relatedness?node1=/c/%s/%s&node2=/c/%s/%s'
def get_similarity(word1, word2, lang='th'):
    score = requests.get(BASE_URL % (lang, word1, lang, word2)).json()['value']
    print(f'score for {word1} and {word2} = {score}')
    return score if score > 0 else None

Then I walked through pairs in word similarity datasets for Thai and gathered scores from your API into a dict which I saved with pickle. I got some responses with negative scores and with value of 0.0, but also there were positive scores, which I have in my saved dict. So the example above was from those experiments I did about a week or two ago.

rspeer commented 4 years ago

Following up on this: we found out why relatedness used to be returning occasional non-zero scores in languages that the API doesn't support, such as Thai, and it was because of one part of the out-of-vocabulary strategy that we used before 5.8.

The strategy was, if a word is not in the vocabulary of the embeddings, look up its neighbors in ConceptNet and use the average of their embeddings (when they exist) instead.

This strategy was unwieldy to use at evaluation time, and made ConceptNet Numberbatch look worse when evaluated by others, who probably hadn't set up a ConceptNet database and therefore had to skip this step. So, starting in 5.8, we baked this into the embeddings, expanding their vocabulary to cover much more of the vocabulary of ConceptNet. But the fact that we can do that doesn't change that we can only support a certain number of languages in the API, due to memory constraints.

In summary:

We didn't expect the relatedness API to give results in Thai
The results are probably not very good, because the API does not load any embeddings of Thai words
I highly recommend you directly use ConceptNet Numberbatch, which does support Thai, instead of the API, which doesn't

If you need to be able to reproduce results from the 5.7 API independently of whether they were any good, I think the best approach would be to run your own copy of the API server, and check out the version5.7 branch of the ConceptNet code.

HeroadZ commented 4 years ago

@rspeer Could you tell me how you calculate the relatedness based on ConceptNet Numberbatch? Is it just cosine similarity or dot product？ Sorry if it's a stupid question.

rspeer commented 4 years ago

It's fine! Dot product should do the job, because the vectors are normalized to unit length anyway, unless they're the zero vector.

commonsense / conceptnet5

Relatedness score of 0.0 for every Thai word pairs #288