bheinzerling / bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
https://nlp.h-its.org/bpemb
MIT License
1.18k stars 101 forks source link

adding special tokens to a BPEmb model #53

Closed tannonk closed 3 years ago

tannonk commented 3 years ago

Hi,

Thanks for this excellent resource! I've been using BPEmbs in my models since learning about them recently and have found them to work quite well. I'm currently trying to figure out how to use them most effectively with my data which is pre-processed with certain masking tokens for privacy, e.g. <name>, <digit>, etc.

This might be an obvious question, but can you think of a way to extend the vocabulary by adding special tokens to a pre-trained sentencepiece model or is this out of the question? If so, perhaps it would be possible to allow for a certain number of arbitrary special tokens in future iterations of BPEmbs.

Thanks in advance!

bheinzerling commented 3 years ago

Hi,

Unfortunately, adding special tokens to a pretrained sentencepiece model isn't supported by sentencepiece. It's possible to specify user-defined symbols (https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md), but that requires training a sentencepiece model and corresponding embeddings from scratch. This repository shows how to train your own models and embeddings: https://github.com/stephantul/piecelearn

tannonk commented 3 years ago

Ok, thanks for the clarification!

tannonk commented 3 years ago

Hey,

Sorry to bother you with this once again, but I figured out how to extend a pretrained sentencepiece model with user-defined symbols following the hints at https://github.com/google/sentencepiece/issues/426.

e.g.

model_file = 'en.wiki.bpe.vs10000.model'
symbols = ['<endtitle>', '<name>', '<url>', '<digit>', '<email>', '<loc>', '<greeting>', '<salutation>', '[CLS]', '[SEP]']

mp.ParseFromString(open(model_file, 'rb').read())

print(f'Original model pieces: {len(mp.pieces)}')

for i, sym in enumerate(symbols, 1):
    new_sym = mp.SentencePiece()
    new_sym.piece = sym 
    new_sym.score = 0.0 # default score for USER_DEFINED
    new_sym.type = 4 # type value for USER_DEFINED
    mp.pieces.insert(2+i, new_sym) # position after default control symbols ("<unk>", "<s>", "</s>")
    print(f'added {new_sym}...')

print(f'New model pieces: {len(mp.pieces)}')

outfile = 'en.ext.wiki.bpe.vs10000.model'

with open(outfile, 'wb') as f:
    f.write(mp.SerializeToString())

The newly extended sentencepiece model encodes a string containing a special tokens as expected:

sp = spm.SentencePieceProcessor(model_file=str('en.ext.wiki.bpe.vs10000.model'))
sp.encode_as_pieces('[CLS] this is a test <name> <digit> .')
>>> ['▁', '[CLS]', '▁this', '▁is', '▁a', '▁test',  '▁', '<name>',  '▁', '<digit>', '▁.']

Depending on how the embedding models are saved, it would then be possible to load a BPEmb model with Gensim and continue training on a small in-domain corpus. But from what I can tell, the models available at https://nlp.h-its.org/bpemb/#download do not support continued training. Are the full models available anywhere?

I just came across #36, I suppose that answers it!

Thanks!

bheinzerling commented 3 years ago

Thanks for posting this, good to know!

leminhyen2 commented 3 years ago

@tannonk What is the mp module that you use here to call mp.ParseFromString and mp.SentencePiece()?

tannonk commented 3 years ago

@leminhyen2, good point, that part is missing from the code snippet above. The mp here is SentencePiece's ModelProto() object. Check out this post from the original author, which includes it simply as m: https://github.com/google/sentencepiece/issues/121#issuecomment-400362011

leminhyen2 commented 3 years ago

@tannonk Thank you, that was very helpful

heyuanYao-pku commented 4 months ago

@tannonk Saved my day! Really helpful! I finally found the type must be set, otherwise the tokenizer will always ignore my symbol.