A minor bug about tokenizer name filter

NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html

Apache License 2.0

11.56k stars 2.42k forks source link

A minor bug about tokenizer name filter #1004

Closed chychen closed 3 years ago

chychen commented 4 years ago

https://github.com/NVIDIA/NeMo/blob/6452ae3b51b969e6b778947ddaacb7c91d2780f7/nemo/collections/nlp/data/tokenizers/bert_tokenizer.py#L77

we have so many format here, I don't think the above filter is general enough for all format in https://huggingface.co/models, such as hfl/chinese-bert-wwm-ext and hfl/chinese-bert-wwm both are not fit to the rule of .split('-')[0].

chychen commented 3 years ago

seems it is fixed