NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.72k stars 2.45k forks source link

Can't import BERT model for NLP. #573

Closed dmitrytyrin closed 4 years ago

dmitrytyrin commented 4 years ago

Dear team, I try to predict punctuation following this tutorial: https://nvidia.github.io/NeMo/nlp/punctuation.html. I can't define tokenizer using pretrained "bert-base-multilingual-uncased" model. tokenizer = nemo.collections.nlp.data.NemoBertTokenizer(pretrained_model="bert-base-multilingual-uncased") gives me error:

OSError: Model name 'bert-base-multilingual-uncased' was not found in tokenizers model name list
(bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, bert-base-finnish-cased-v1, bert-base-finnish-uncased-v1, bert-base-dutch-cased).
We assumed 'bert-base-multilingual-uncased' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such vocabulary files at this path or url.

I manually downloaded BERT from https://github.com/google-research/bert/blob/master/multilingual.md and tried to use absolute path in pretrained_model_name. (Folder "multilingual_L-12_H-768_A-12" containes config.json, model.data, model.index, model.meta and vocab.txt) Tokenizer gives me:

ValueError: Bert_derivative value {bert_derivative} is not currently supported Please choose from the following list: {TOKENIZERS.keys()}

How can I solve my issue?

ekmb commented 4 years ago

@dmitrytyrin could you try the punctuation notebook and use PRETRAINED_BERT_MODEL = "bert-base-multilingual-uncased"?

dmitrytyrin commented 4 years ago

@ekmb, thank you, it works!

How can I perform only inference with pretrained bert? How can I get punct_label_ids/capit_label_ids if I don't have train_data_layer?

ekmb commented 4 years ago

You should train the model to do inference, beside the pretrained bert part there are 2 token classification heads that need to be trained.