Open lparam opened 6 years ago
Hello. Thanks for the great work! again. I'm also interested in training a BPEmb embeddings on my custom data. There would be way or an example of how to apply this feature.
Best Regards
+1
Most of my original training script deals with training many different embeddings for all languages on a cluster (not sure how much sense it makes to share this), but the basic procedure is quite simple:
$sentencepiece_dir=/install/sentencepiece/and/set/this/path
$glove_dir=/install/glove/and/set/this/path
$corpus=corpus.txt
$corpus_preproc=corpus_preproc.txt
$vocab_size=100000
$emb_dim=100
$model_type=bpe
$model_prefix=${corpus_preproc}.${model_type}.${vocab_size}
$emb_out=$model_prefix.d${emb_dim}
# preprocessing
# you probably want to lowercase everything and replace all digits with 0
# the preprocessing I used is quite specific to Wikipedia, depending on your corpus you can do something much simpler
# remove wikipedia section header === and article title ''' markers, silly sentence split on " " and remove initial whitespace
sed "s/===\+/\n/g;s/'''//g;s/ /\n/g" $corpus | perl -C -pe 's/\x{200B}|\x{200C}|\x{200D}|\x{200E}|\x{202C}|\x{96}//g' | tr -s [[:blank:]] " " | sed -re 's/\xc2\x91\|\xc2\x92\|\xc2\xa0\|\xe2\x80\x8e//g;s#(https?://[^">< ]+)#🔗#g;s/[0-9]/0/g;s/^ \+//' | grep ".\{100\}" | sed "s/^ //" > $corpus_preproc
# train SentencePiece model
$sentencepiece_dir/bin/spm_train --split_by_whitespace true --input $corpus_preproc --model_prefix $model_prefix --vocab_size $vocab_size --model_type $model_type
# encode preprocessed corpus with the trained SentencePiece model
$model_file=${model_prefix}.model
$corpus_encoded=corpus_encoded.txt
# encoding to numerical IDs (--output_format=id) saves you headaches if your corpus contains weird whitespace characters that might get treated differently between SentencePiece and Glove. You can leave this out if your corpus is quite clean.
cat $corpus_preproc | $sentencepiece_dir/bin/spm_encode --model $model_file --output $corpus_encoded --extra_options=bos:eos # --output_format=id
# train BPE embeddings with GloVe
$glove_dir/run.sh $corpus_encoded $emb_out $emb_dim
This will give you BPE embeddings in GloVe format ${emb_out}.glove.txt
I copy&pasted this from my actual scripts, let me know if this works for you.
Finally, the embeddings in glove format are in different order than the subwords in the BPE vocabulary, so the last step is to reorder them. If the above works for you I can think of a way to properly add this to the repo (not just a comment) and maybe can make it into a push-button solution.
Thank you very much!
Hello, I manage to train my own embeddings using glove based on your spm then I try to load them into bpemb as you commented in #23 by using:
from bpemb.util import sentencepiece_load, load_word2vec_file
bpemb = BPEmb(lang='en')
bpemb.spm = sentencepiece_load('/some/folder/en.wiki.bpe.vs200000.model')
bpemb.emb = load_word2vec_file('/some/folder/my_byte_pair_emb.w2v.bin')
but I still didn't make the reorder of the vectors, could you help me with an insight on this?
Best regards
Assuming you have a SentencePiece .vocab file for your model, let's first write a helper function for loading this:
def get_vocab(vocab_file, vocab_size):
with vocab_file.open(encoding="utf8") as f:
# read lines, ignoring fun characters such as 'LINE SEPARATOR' (U+2028)
# which Python treats as line breaks when reading files
# with the ususal 'for line in f' pattern
vocab_lines = f.read().split("\n")[:-1]
assert len(vocab_lines) == vocab_size
vocab, ranks = zip(*map(lambda l: l.split("\t"), vocab_lines))
return vocab
Now the function for converting from GloVe order embeddings to the proper order:
from gensim.models import keyedvectors
from dougu import to_from_idx # https://github.com/bheinzerling/dougu/blob/d90e6c0ba92e61378c3c03df78ce5ba020f65ff8/dougu/iters.py#L70
import numpy as np
def convert_emb(glove_order_vocab_file, glove_order_emb_file):
glove_order_vocab = get_vocab(glove_order_vocab_file)
piece2id, id2piece = to_from_idx(vocab)
glove_order_emb = keyedvectors.load_word2vec_format(glove_order_emb_file)
v = glove_order_emb.vectors
# sample embeddings for symbols that didn't occur in the training
# data from normal distribution with same mean and variance
new_v = v.std() * np.random.randn(len(glove_order_vocab), v.shape[1]) + v.mean()
new_vocab = {}
# go through all entries (piece) in the vocabulary with their corresponding id
for id, piece in id2piece.items():
try:
new_v[id] = glove_order_emb[str(id)] # str(id) assumes you used '--output_format=id', as described here https://github.com/bheinzerling/bpemb/issues/5#issuecomment-481616023
except KeyError:
pass
# gensim sorts embeddings by -count when saving
# set count to -id to preserve sentencepiece order
assert piece not in new_vocab
new_vocab[piece] = keyedvectors.Vocab(count=-id, index=id)
proper_order_emb.index2word = id2piece
proper_order_emb.vocab = new_vocab
proper_order_emb.vectors = new_v
return proper_order_emb
Copied this together from my actual scripts, let me know if this works for you.
@bheinzerling I would be awesome if the training routine could be added here (I'm currently training bpemb's for historic texts).
Currently, I'm using the default parameters as provided in the GloVe demo script (I only adjusted dimenstion size to 300) 🤗
@stefan-it The main difference to the demo script is setting VOCAB_MIN_COUNT=0 which creates embeddings for all byte-pair symbols, not just frequent ones.
#! /usr/bin/env bash
set -eou pipefail
# set this to something else if you want to keep GloVe co-occurrence files permanently,
# say, to create embeddings of the same corpus with different dimensions
TMP=/tmp
mkdir -p $TMP
# need to set this
BUILDDIR=/SET/THIS/TO/PATH/OF/glove/build
# set this to something appropriate for your system
NUM_THREADS=24
# path of single plain text file containing the byte-pair encoded corpus
CORPUS=$1
# where the GloVe files should be saved
OUT=$2
# GloVe embedding dim
VECTOR_SIZE=$3
FNAME=$(echo $CORPUS | sed "s#/#_#g")
SAVE_FILE=$OUT.glove
VERBOSE=2
MEMORY=64.0
# we want embeddings for *all* BPE symbols
VOCAB_MIN_COUNT=0
MAX_ITER=50
WINDOW_SIZE=15
BINARY=0
X_MAX=10
# this part is probably not necessary unless you create lots of embeddings
VOCAB_FILE=$TMP/$FNAME.vocab.txt
COOCCURRENCE_FILE=$TMP/$FNAME.cooccurrence.bin
COOCCURRENCE_SHUF_FILE=$TMP/$FNAME.cooccurrence.shuf.bin
# random filenames for overflow and tempshuf files to prevent naming clashes
OVERFLOW=$TMP/${FNAME}.overflow_$(echo $RANDOM $RANDOM $RANDOM $RANDOM $RANDOM | md5sum | cut -c -8)
TEMPSHUF=$TMP/${FNAME}.tempshuf_$(echo $RANDOM $RANDOM $RANDOM $RANDOM $RANDOM | md5sum | cut -c -8)
# create vocab and cooccurrence files only once
if [ ! -f $VOCAB_FILE ]; then
echo "$ $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE"
$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
fi
if [ ! -f $COOCCURRENCE_FILE ]; then
echo "$ $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE"
$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE -overflow-file $OVERFLOW < $CORPUS > $COOCCURRENCE_FILE
if [ -f $OVERFLOW ]; then
rm $OVERFLOW
fi
fi
if [ ! -f $COOCCURRENCE_SHUF_FILE ]; then
echo "$ $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE -temp-file $TEMPSHUF < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE"
$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE -temp-file $TEMPSHUF < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
if [ -f $TEMPSHUF ]; then
rm $TEMPSHUF
fi
fi
# print the command we're running
echo "$ $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE -write-header 1 -alpha 0.75 -eta 0.03"
# the actual command
# GloVe will cause a segmentation fault for some combinations of large vocabulary sizes and large vector sizes.
# In those cases, changing alpha and eta slightly fixes the problem ‾\_(ツ)_/‾
$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE -write-header 1 -alpha 0.75 -eta 0.03
# delete the <unk> embedding, assumes that <unk> doesn't occur as part of some BPE symbol
sed -i "/<unk>/d" ${SAVE_FILE}.txt
For those interested: I created a python script that creates a sentencepiece model on a training corpus, after which it segments the corpus, and trains BPE embeddings. The end result is an embedding space which is aligned with the sentencepiece model. It doesn't use glove though.
See here: https://github.com/stephantul/piecelearn
@bheinzerling I want to use BPEmb, but in your training script you used sentencePiece for training and encoding . How can I use BPEmb model for data preprocessing?
@bheinzerling Could you provide training script?I want to train with my own data.