digling / intelligibility

MIT License
0 stars 0 forks source link

Comparability of Semantic Vectors Across Languages #6

Closed LinguList closed 7 months ago

LinguList commented 11 months ago

@justalingwist, I am asking myself, how comparable are these semantic vectors across languages in the end? I mean, they are trained on separate data, they are not based on multilingual training, so the dimensions are completely independent from each other.

So if you have semantic vectors V-Dt and V-Gr, with 100 dimensions each, how would you compare the two anyway? Cosine would not work, since dimensions are not guaranteed to match. Unlike in phonetics, where we have "substances" in the form of sounds / graphemes / letters, we don't have this substance in vectors, which are purely language-internal.

So my question now is: if accuracy goes down, is that maybe because of the vector-comparison problem? And how can we overcome it?

LinguList commented 11 months ago

One could maybe even check with cognates, if the cosine distances of semantic vectors between Dt and Gr differ when comparing cognate words vs. when comparing non-cognate words. My guess is: they won't differ, due to the language-internal construction of vectors.

LinguList commented 11 months ago

One could probably test on cross-linguistic semantic vectors. But I don't know of implementations ready to use here.

justalingwist commented 11 months ago

@LinguList What I’m using for the project at the moment are numberbatch vectors and these are in fact multilingual vectors that are supposed to be comparable across languages. But I agree that we can check how comparable they are for cognates and non-cognates, that is how comparable they really are for each word form

justalingwist commented 11 months ago

And since they are multilingual and are supposed to catch up similarities across languages, I’m no too worried. Also from the perspective that what we want to mirror here is human speech production and comprehension having the exact same semantics for pairs of words wouldn’t make sense. What we wanted to test here is how far I can get if I know the phonology and meaning in one language and apply this knowledge to a new neighboring language. And I think with the current setup this is exactly what we do, I think we just shouldn’t expect perfect model results.

LinguList commented 11 months ago

Is there a publication on numberbatch? I would like to know HOW they actually construct multilingual vectors.

justalingwist commented 11 months ago

Yes, it’s here: https://ojs.aaai.org/index.php/AAAI/article/download/11164/11023

LinguList commented 11 months ago

I think, what we must do is check how well they account for translation, these vectors. I find the paper not explaining the embeddings well, but this at the side, suppose we have:

  1. a list of 500 terms in German and Dutch, translational equivalents (aligned with basic concepts)
  2. a list of 500 cognates in German and Dutch

If the embeddings are meaningful also in a cross-lingual sense, we'd expect the correlation between terms in 1 to be higher than in random selections where we shuffle the list, and 2 would also be high, but not necessarily due to semantic shift.

Can we easily conduct these experiments / test, if I provide you wit the list of translational eqiuvalents, which I happen to have in fact?

justalingwist commented 11 months ago

yes, please send the list!

LinguList commented 11 months ago

It is uploaded under data/comparative-wordlist.tsv.

LinguList commented 11 months ago

I added a shell script (requires pyedictor in Python and lingpy) that downloads the data and converts them. Then, sound classes are added with a new script prepare-wordlist.py that creates the file in data.

LinguList commented 11 months ago

Cognacy information is also available.

LinguList commented 11 months ago

1 means: both words are cognate, 0 means, they aren't.

justalingwist commented 11 months ago

@LinguList I uploaded the python code for computing semantic similarity across the word pairs for comparative-wordlist.tsv. Overall, the semantic similarity between word pairs is pretty high. I would conclude from this that we can justify using the numberbatch vectors for our cross-language modeling approach.

LinguList commented 11 months ago

So we have about 10 cases where you cannot match the words in German and Dutch, but the rest is fine? That is okay then, although I wonder if one could try to find out if this is just for ß in German, so replacing ß by ss could even handle those term?

LinguList commented 11 months ago

And could we output the data with vectors in the data-folder? Then we can load them more easily, and also separate the computation of the cosine similarities from the first extraction of the vectors...