Closed loretoparisi closed 5 years ago
This is a common problem in subword-based approaches. What composition function is best for getting word embeddings from subword embeddings? So far, no one has found a good solution that works well in most or all cases.
Just to show that this is an ongoing research topic, here is a recent paper which compares addition, positional embeddings, and attention as composition functions: https://arxiv.org/abs/1904.07994
By the way, in BERT there is no average pooling. Instead, the authors simple pick the state corresponding to the first subword in a given word. This works well because they're not just looking up subword embeddings, but running BERT over the entire subword sequence, so that the state corresponding to one subword is influenced by the subwords around it. You can do something similar by running an LSTM over byte-pair embeddings.
To sum up, the best answer I can give is: you have to try and see what works best for your use case.
@bheinzerling thank you for the reference, yes it seems that so far it is an unsolved problem. Maybe it could be interesting also the new work Subword-based Compact Reconstruction of Word Embeddings - https://www.aclweb.org/anthology/N19-1353 that points out some important considerations about subwords reconstruction and quality degradation across benchmark datasets.
Supposed I have the embedding of
come
:and I want to compare it with the bpe embeddings for let's say
home
:and it works, but supposed that I have a different split of the BPE, so with a different shape of the embed like:
How to compare these ones with a different shape here? I have tried
np.mean
andnp.max
but it does not work fine, I'm also trying using a simple PCA to get the principal components of each axis like:etc. My worder is if an avg pooling (like the on in BERT) could be the right approach.