bheinzerling / bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
https://nlp.h-its.org/bpemb
MIT License
1.18k stars 101 forks source link

Compare embeddings #29

Closed loretoparisi closed 5 years ago

loretoparisi commented 5 years ago

Supposed I have the embedding of come:

import numpy as np
from bpemb import BPEmb
bpemb_en = BPEmb(lang="en", dim=100, vs=100000)
bpemb_en.embed("come").shape
(1, 100)

and I want to compare it with the bpe embeddings for let's say home:

def cosine_sim(a, b):
    ''' naive cosine similarity impl '''
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

l1=bpemb_en.embed("home")
l2=bpemb_en.embed("come")
>>> cosine_sim(l1[0],l2[0])
0.48083022

and it works, but supposed that I have a different split of the BPE, so with a different shape of the embed like:


>>> bpemb_en.encode("c-c-c-c-come")
['▁c', '-', 'c', '-', 'c', '-', 'c', '-', 'come']
>>> bpemb_en.embed("come").shape
(1, 100)

How to compare these ones with a different shape here? I have tried np.mean and np.max but it does not work fine, I'm also trying using a simple PCA to get the principal components of each axis like:


def embedding(s):
   return bpemb_en.embed(s)[0]

def doPCA(pairs, embedding, num_components = 10):
    matrix = []
    for a, b in pairs:
        center = (embedding(a) + embedding(b))/2
        matrix.append(embedding(a) - center)
        matrix.append(embedding(b) - center)
    matrix = np.array(matrix)
    pca = PCA(n_components = num_components)
    pca.fit(matrix)
    # bar(range(num_components), pca.explained_variance_ratio_)
    return pca

pairs = [('come','c-c-c-c-come')]
doPCA(pairs, embedding, num_components = 1)
>>> pca.transform(l1)
array([[-0.242803]], dtype=float32)
>>> pca.transform(l2)
array([[ 0.112916],
       [ 0.402301],
       [-0.147132],
       [ 0.402301],
       [-0.147132],
       [ 0.402301],
       [-0.390385]], dtype=float32)

etc. My worder is if an avg pooling (like the on in BERT) could be the right approach.

bheinzerling commented 5 years ago

This is a common problem in subword-based approaches. What composition function is best for getting word embeddings from subword embeddings? So far, no one has found a good solution that works well in most or all cases.

Just to show that this is an ongoing research topic, here is a recent paper which compares addition, positional embeddings, and attention as composition functions: https://arxiv.org/abs/1904.07994

By the way, in BERT there is no average pooling. Instead, the authors simple pick the state corresponding to the first subword in a given word. This works well because they're not just looking up subword embeddings, but running BERT over the entire subword sequence, so that the state corresponding to one subword is influenced by the subwords around it. You can do something similar by running an LSTM over byte-pair embeddings.

To sum up, the best answer I can give is: you have to try and see what works best for your use case.

loretoparisi commented 5 years ago

@bheinzerling thank you for the reference, yes it seems that so far it is an unsolved problem. Maybe it could be interesting also the new work Subword-based Compact Reconstruction of Word Embeddings - https://www.aclweb.org/anthology/N19-1353 that points out some important considerations about subwords reconstruction and quality degradation across benchmark datasets.