Closed tannonk closed 3 years ago
Hi,
Unfortunately, adding special tokens to a pretrained sentencepiece model isn't supported by sentencepiece. It's possible to specify user-defined symbols (https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md), but that requires training a sentencepiece model and corresponding embeddings from scratch. This repository shows how to train your own models and embeddings: https://github.com/stephantul/piecelearn
Ok, thanks for the clarification!
Hey,
Sorry to bother you with this once again, but I figured out how to extend a pretrained sentencepiece model with user-defined symbols following the hints at https://github.com/google/sentencepiece/issues/426.
e.g.
model_file = 'en.wiki.bpe.vs10000.model'
symbols = ['<endtitle>', '<name>', '<url>', '<digit>', '<email>', '<loc>', '<greeting>', '<salutation>', '[CLS]', '[SEP]']
mp.ParseFromString(open(model_file, 'rb').read())
print(f'Original model pieces: {len(mp.pieces)}')
for i, sym in enumerate(symbols, 1):
new_sym = mp.SentencePiece()
new_sym.piece = sym
new_sym.score = 0.0 # default score for USER_DEFINED
new_sym.type = 4 # type value for USER_DEFINED
mp.pieces.insert(2+i, new_sym) # position after default control symbols ("<unk>", "<s>", "</s>")
print(f'added {new_sym}...')
print(f'New model pieces: {len(mp.pieces)}')
outfile = 'en.ext.wiki.bpe.vs10000.model'
with open(outfile, 'wb') as f:
f.write(mp.SerializeToString())
The newly extended sentencepiece model encodes a string containing a special tokens as expected:
sp = spm.SentencePieceProcessor(model_file=str('en.ext.wiki.bpe.vs10000.model'))
sp.encode_as_pieces('[CLS] this is a test <name> <digit> .')
>>> ['▁', '[CLS]', '▁this', '▁is', '▁a', '▁test', '▁', '<name>', '▁', '<digit>', '▁.']
Depending on how the embedding models are saved, it would then be possible to load a BPEmb model with Gensim and continue training on a small in-domain corpus. But from what I can tell, the models available at https://nlp.h-its.org/bpemb/#download do not support continued training. Are the full models available anywhere?
I just came across #36, I suppose that answers it!
Thanks!
Thanks for posting this, good to know!
@tannonk What is the mp module that you use here to call mp.ParseFromString and mp.SentencePiece()?
@leminhyen2, good point, that part is missing from the code snippet above. The mp
here is SentencePiece's ModelProto()
object. Check out this post from the original author, which includes it simply as m
: https://github.com/google/sentencepiece/issues/121#issuecomment-400362011
@tannonk Thank you, that was very helpful
@tannonk Saved my day! Really helpful! I finally found the type must be set, otherwise the tokenizer will always ignore my symbol.
Hi,
Thanks for this excellent resource! I've been using BPEmbs in my models since learning about them recently and have found them to work quite well. I'm currently trying to figure out how to use them most effectively with my data which is pre-processed with certain masking tokens for privacy, e.g.
<name>
,<digit>
, etc.This might be an obvious question, but can you think of a way to extend the vocabulary by adding special tokens to a pre-trained sentencepiece model or is this out of the question? If so, perhaps it would be possible to allow for a certain number of arbitrary special tokens in future iterations of BPEmbs.
Thanks in advance!