Separius / BERT-keras

Keras implementation of BERT with pre-trained weights
GNU General Public License v3.0
815 stars 197 forks source link

unknown field name "training_sentence_size" in TrainerSpec. #22

Closed ShabRa1365 closed 5 years ago

ShabRa1365 commented 5 years ago

Hi Just running the first cell of tutorial.ipynb from data.vocab import SentencePieceTextEncoder # you could also import OpenAITextEncoder

sentence_piece_encoder = SentencePieceTextEncoder(text_corpus_address='/openai/model/params_shapes.json',model_name='tutorial', vocab_size=20)

Facing with this error from vocab.py file:

File "/Users/shabnamrashtchi/Dropbox/Deep leanring 2019_2020 reserch/Embedinng/BERT-keras-2/data/vocab.py", line 69, in init model_type=spm_model_type.lower()))

OSError: Not found: unknown field name "training_sentence_size" in TrainerSpec.

Separius commented 5 years ago

duplicate of #18

ShabRa1365 commented 5 years ago

I review #18 , and removed training_sentence_size from vocab.py , which looks like this:

class SentencePieceTextEncoder(TextEncoder):
def __init__(self, text_corpus_address: Optional[str], model_name: str = 'spm',
             vocab_size: int = 30000, spm_model_type: str = 'unigram') -> None:
    super().__init__(vocab_size)
    if not os.path.exists('{}.model'.format(model_name)):
        if spm_model_type.lower() not in ('unigram', 'bpe', 'char', 'word'):
            raise ValueError(
                '{} is not a valid model_type for sentence piece, '
                'valid options are: unigram, bpe, char, word'.format(spm_model_type))
        spm.SentencePieceTrainer.Train(
            '--input={input} --model_prefix={model_name} --vocab_size={vocab_size} '
            '--character_coverage={coverage} --model_type={model_type} '
            '--pad_id=-1 --unk_id=0 --bos_id=-1 --eos_id=-1 --input_sentence_size=100000000 '
            )
    self.sp = spm.SentencePieceProcessor()
    self.sp.load('{}.model'.format(model_name))

def encode(self, sent: str) -> List[int]:
    return self.sp.encode_as_ids(sent)

have the following error running tutorial:

File "/Users/.../BERT-keras-2/data/vocab.py", line 81, in init '--input={input} --model_prefix={model_name} --vocab_size={vocab_size} '

File "", line unknown SyntaxError: Invalid argument: cannot parse "{vocab_size}" as int32.

Separius commented 5 years ago

try this:

spm.SentencePieceTrainer.Train(
            '--input={input} --model_prefix={model_name} --vocab_size={vocab_size} '
            '--character_coverage={coverage} --model_type={model_type} '
            '--pad_id=-1 --unk_id=0 --bos_id=-1 --eos_id=-1'.format(
                input=text_corpus_address, model_name=model_name,
                vocab_size=vocab_size, coverage=1, model_type=spm_model_type.lower()))
ShabRa1365 commented 5 years ago

Thanks a lot, it resolved the issue.