bheinzerling / bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
https://nlp.h-its.org/bpemb
MIT License
1.18k stars 101 forks source link

Training script #5

Open lparam opened 6 years ago

lparam commented 6 years ago

@bheinzerling Could you provide training script?I want to train with my own data.

alejandrojcastaneira commented 5 years ago

Hello. Thanks for the great work! again. I'm also interested in training a BPEmb embeddings on my custom data. There would be way or an example of how to apply this feature.

Best Regards

Danil328 commented 5 years ago

+1

bheinzerling commented 5 years ago

Most of my original training script deals with training many different embeddings for all languages on a cluster (not sure how much sense it makes to share this), but the basic procedure is quite simple:

  1. Preprocess corpus.
  2. Learn BPE model on corpus, using SentencePiece.
  3. Encode corpus with BPE model, again using SentencePiece.
  4. Learn embeddings on encoded corpus, using GloVe.
$sentencepiece_dir=/install/sentencepiece/and/set/this/path
$glove_dir=/install/glove/and/set/this/path

$corpus=corpus.txt
$corpus_preproc=corpus_preproc.txt
$vocab_size=100000
$emb_dim=100
$model_type=bpe
$model_prefix=${corpus_preproc}.${model_type}.${vocab_size}
$emb_out=$model_prefix.d${emb_dim}

# preprocessing
# you probably want to lowercase everything and replace all digits with 0
# the preprocessing I used is quite specific to Wikipedia, depending on your corpus you can do something much simpler

# remove wikipedia section header === and article title ''' markers, silly sentence split on "  " and remove initial whitespace
sed "s/===\+/\n/g;s/'''//g;s/  /\n/g" $corpus | perl -C -pe 's/\x{200B}|\x{200C}|\x{200D}|\x{200E}|\x{202C}|\x{96}//g' | tr -s [[:blank:]] " " | sed -re 's/\xc2\x91\|\xc2\x92\|\xc2\xa0\|\xe2\x80\x8e//g;s#(https?://[^">< ]+)#🔗#g;s/[0-9]/0/g;s/^ \+//'  | grep ".\{100\}" | sed "s/^ //" > $corpus_preproc

# train SentencePiece model
$sentencepiece_dir/bin/spm_train --split_by_whitespace true --input $corpus_preproc --model_prefix $model_prefix --vocab_size $vocab_size --model_type $model_type

# encode preprocessed corpus with the trained SentencePiece model
$model_file=${model_prefix}.model
$corpus_encoded=corpus_encoded.txt
# encoding to numerical IDs (--output_format=id) saves you headaches if your corpus contains weird whitespace characters that might get treated differently between SentencePiece and Glove. You can leave this out if your corpus is quite clean.
cat $corpus_preproc | $sentencepiece_dir/bin/spm_encode --model $model_file --output $corpus_encoded --extra_options=bos:eos # --output_format=id

# train BPE embeddings with GloVe
$glove_dir/run.sh $corpus_encoded $emb_out $emb_dim

This will give you BPE embeddings in GloVe format ${emb_out}.glove.txt

I copy&pasted this from my actual scripts, let me know if this works for you.

Finally, the embeddings in glove format are in different order than the subwords in the BPE vocabulary, so the last step is to reorder them. If the above works for you I can think of a way to properly add this to the repo (not just a comment) and maybe can make it into a push-button solution.

Danil328 commented 5 years ago

Thank you very much!

alejandrojcastaneira commented 5 years ago

Hello, I manage to train my own embeddings using glove based on your spm then I try to load them into bpemb as you commented in #23 by using:

from bpemb.util import sentencepiece_load, load_word2vec_file

bpemb = BPEmb(lang='en')
bpemb.spm = sentencepiece_load('/some/folder/en.wiki.bpe.vs200000.model')
bpemb.emb = load_word2vec_file('/some/folder/my_byte_pair_emb.w2v.bin')

but I still didn't make the reorder of the vectors, could you help me with an insight on this?

Best regards

bheinzerling commented 5 years ago

Assuming you have a SentencePiece .vocab file for your model, let's first write a helper function for loading this:

def get_vocab(vocab_file, vocab_size):
    with vocab_file.open(encoding="utf8") as f:
        # read lines, ignoring fun characters such as 'LINE SEPARATOR' (U+2028)
        # which Python treats as line breaks when reading files
        # with the ususal 'for line in f' pattern
        vocab_lines = f.read().split("\n")[:-1]
    assert len(vocab_lines) == vocab_size
    vocab, ranks = zip(*map(lambda l: l.split("\t"), vocab_lines))
    return vocab

Now the function for converting from GloVe order embeddings to the proper order:

from gensim.models import keyedvectors
from dougu import to_from_idx  # https://github.com/bheinzerling/dougu/blob/d90e6c0ba92e61378c3c03df78ce5ba020f65ff8/dougu/iters.py#L70
import numpy as np

def convert_emb(glove_order_vocab_file, glove_order_emb_file):
    glove_order_vocab = get_vocab(glove_order_vocab_file)
    piece2id, id2piece = to_from_idx(vocab)
    glove_order_emb = keyedvectors.load_word2vec_format(glove_order_emb_file)
    v = glove_order_emb.vectors
    # sample embeddings for symbols that didn't occur in the training
    # data from normal distribution with same mean and variance
    new_v = v.std() * np.random.randn(len(glove_order_vocab), v.shape[1]) + v.mean()
    new_vocab = {}
    # go through all entries (piece) in the vocabulary with their corresponding id
    for id, piece in id2piece.items():
        try:
            new_v[id] = glove_order_emb[str(id)]  # str(id) assumes you used '--output_format=id', as described here https://github.com/bheinzerling/bpemb/issues/5#issuecomment-481616023
        except KeyError:
            pass
        # gensim sorts embeddings by -count when saving
        # set count to -id to preserve sentencepiece order
        assert piece not in new_vocab
        new_vocab[piece] = keyedvectors.Vocab(count=-id, index=id)

    proper_order_emb.index2word = id2piece
    proper_order_emb.vocab = new_vocab
    proper_order_emb.vectors = new_v
    return proper_order_emb

Copied this together from my actual scripts, let me know if this works for you.

stefan-it commented 5 years ago

@bheinzerling I would be awesome if the training routine could be added here (I'm currently training bpemb's for historic texts).

Currently, I'm using the default parameters as provided in the GloVe demo script (I only adjusted dimenstion size to 300) 🤗

bheinzerling commented 5 years ago

@stefan-it The main difference to the demo script is setting VOCAB_MIN_COUNT=0 which creates embeddings for all byte-pair symbols, not just frequent ones.

#! /usr/bin/env bash
set -eou pipefail

# set this to something else if you want to keep GloVe co-occurrence files permanently,
# say, to create embeddings of the same corpus with different dimensions
TMP=/tmp
mkdir -p $TMP

# need to set this
BUILDDIR=/SET/THIS/TO/PATH/OF/glove/build

# set this to something appropriate for your system
NUM_THREADS=24

# path of single plain text file containing the byte-pair encoded corpus
CORPUS=$1
# where the GloVe files should be saved
OUT=$2
# GloVe embedding dim
VECTOR_SIZE=$3

FNAME=$(echo $CORPUS | sed "s#/#_#g")
SAVE_FILE=$OUT.glove
VERBOSE=2
MEMORY=64.0

# we want embeddings for *all* BPE symbols
VOCAB_MIN_COUNT=0

MAX_ITER=50
WINDOW_SIZE=15
BINARY=0
X_MAX=10

# this part is probably not necessary unless you create lots of embeddings
VOCAB_FILE=$TMP/$FNAME.vocab.txt
COOCCURRENCE_FILE=$TMP/$FNAME.cooccurrence.bin
COOCCURRENCE_SHUF_FILE=$TMP/$FNAME.cooccurrence.shuf.bin
# random filenames for overflow and tempshuf files to prevent naming clashes
OVERFLOW=$TMP/${FNAME}.overflow_$(echo $RANDOM $RANDOM $RANDOM $RANDOM $RANDOM | md5sum | cut -c -8)
TEMPSHUF=$TMP/${FNAME}.tempshuf_$(echo $RANDOM $RANDOM $RANDOM $RANDOM $RANDOM | md5sum | cut -c -8)
# create vocab and cooccurrence files only once
if [ ! -f $VOCAB_FILE ]; then
    echo "$ $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE"
    $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
fi
if [ ! -f $COOCCURRENCE_FILE ]; then
    echo "$ $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE"
    $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE -overflow-file $OVERFLOW < $CORPUS > $COOCCURRENCE_FILE
    if [ -f $OVERFLOW ]; then
        rm $OVERFLOW
    fi
fi
if [ ! -f $COOCCURRENCE_SHUF_FILE ]; then
    echo "$ $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE -temp-file $TEMPSHUF < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE"
    $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE -temp-file $TEMPSHUF < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
    if [ -f $TEMPSHUF ]; then
        rm $TEMPSHUF
    fi
fi

# print the command we're running
echo "$ $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE -write-header 1 -alpha 0.75 -eta 0.03"

# the actual command
# GloVe will cause a segmentation fault for some combinations of large vocabulary sizes and large vector sizes.
# In those cases, changing  alpha and eta slightly fixes the problem ‾\_(ツ)_/‾
$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE -write-header 1 -alpha 0.75 -eta 0.03

# delete the <unk> embedding, assumes that <unk> doesn't occur as part of some BPE symbol
sed -i "/<unk>/d" ${SAVE_FILE}.txt
stephantul commented 4 years ago

For those interested: I created a python script that creates a sentencepiece model on a training corpus, after which it segments the corpus, and trains BPE embeddings. The end result is an embedding space which is aligned with the sentencepiece model. It doesn't use glove though.

See here: https://github.com/stephantul/piecelearn

shantanu778 commented 4 years ago

@bheinzerling I want to use BPEmb, but in your training script you used sentencePiece for training and encoding . How can I use BPEmb model for data preprocessing?