Loading BERT external vocab file error

Description

When calling get_model(name, **kwargs), we can specify either the --dataset_name, which will load the vocab for that dataset, or we can pass the --vocab for custom BERT vocabulary. An error occurs when calling a BERT model such as bert_12_768_12 with --vocab and without passing --dataset_name.

Error Message

Traceback (most recent call last):
  File "finetune_ner.py", line 260, in <module>
    main(parse_args())
  File "finetune_ner.py", line 121, in main
    bert_model.load_parameters(config.parameters, ctx=ctx, ignore_extra=True)
  File "/env/mx_1.5_gnlp_0.8/lib/python3.5/site-packages/mxnet/gluon/block.py", line 410, in load_parameters
    params[name]._load_init(loaded[name], ctx, cast_dtype=cast_dtype, dtype_source=dtype_source)
  File "/env/mx_1.5_gnlp_0.8/lib/python3.5/site-packages/mxnet/gluon/parameter.py", line 279, in _load_init
    self.name, str(self.shape), str(data.shape))
AssertionError: Failed loading Parameter 'bertmodel0_word_embed_embedding0_weight' from saved params: shape incompatible expected (98, 768) vs saved (28996, 768)

Note that I did make minor modification to finetune_ner.py script, but that is irrelevant. The correct vocabulary size in this example is 28996 and the 98 is the length of the filename passed in the vocab argument in the function _load_vocab(dataset_name, vocab, root, cls=None)

To Reproduce

To reproduce, you need to convert a BERT to Gluon. In my case I convereted Clinical BERT TensorFlow model to Gluon using convert_tf_model.py. Then try loading a model using get_model(name, **kwargs). See instructions below.

Steps to reproduce

Download Clinical BERT. See instructions here
Convert any of the Clinical BERT models to Gluon using convert_tf_model.py
Call get_model(name, **kwargs) without specifying --dataset_name and pass the converted vocabulary using --vocab argument.

    params = {
        'dataset_name': None,
        'vocab': 'path/to/filename.vocab',
        'pretrained': False,
        'ctx': ctx,
        'use_pooler': False,
        'use_decoder': False,
        'use_classifier': False,
        'dropout': 0.1,
        'embed_dropout': 0.1
    }

    bert_model, text_vocab = gluonnlp.model.get_model('bert_12_768_12', **params)

What have you tried to solve it?

_load_vocab(dataset_name, vocab, root, cls=None) returns the vocab variable (str) if --dataset_name is not set, hence the vocab length 98 you see in the error message above. To resolve the issue, I modified the function _load_vocab(dataset_name, vocab, root, cls=None) and added the following lines after line 273

with open(vocab, 'r') as fh:
    vocab = gluonnlp.Vocab().from_json(fh.read())
    return vocab

Environment

    Architecture:          x86_64
    CPU op-mode(s):        32-bit, 64-bit
    Byte Order:            Little Endian
    CPU(s):                64
    On-line CPU(s) list:   0-63
    Thread(s) per core:    2
    Core(s) per socket:    16
    Socket(s):             2
    NUMA node(s):          2
    Vendor ID:             GenuineIntel
    CPU family:            6
    Model:                 79
    Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
    Stepping:              1
    CPU MHz:               2089.316
    CPU max MHz:           3000.0000
    CPU min MHz:           1200.0000
    BogoMIPS:              4600.15
    Hypervisor vendor:     Xen
    Virtualization type:   full
    L1d cache:             32K
    L1i cache:             32K
    L2 cache:              256K
    L3 cache:              46080K
    NUMA node0 CPU(s):     0-15,32-47
    NUMA node1 CPU(s):     16-31,48-63
    Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt ida
    ----------Python Info----------
    Version      : 3.5.2
    Compiler     : GCC 5.4.0 20160609
    Build        : ('default', 'Nov 12 2018 13:43:14')
    Arch         : ('64bit', 'ELF')
    ------------Pip Info-----------
    Version      : 19.2.2
    Directory    : /env/mx_1.5_gnlp_0.8/lib/python3.5/site-packages/pip
    ----------MXNet Info-----------
    Version      : 1.5.0
    Directory    : /env/mx_1.5_gnlp_0.8/lib/python3.5/site-packages/mxnet
    Num GPUs     : 8
    Commit Hash   : 75a9e187d00a8b7ebc71412a02ed0e3ae489d91f
    ----------System Info----------
    Platform     : Linux-4.4.0-1090-aws-x86_64-with-Ubuntu-16.04-xenial
    system       : Linux
    node         : ip-172-31-30-122
    release      : 4.4.0-1090-aws
    version      : #101-Ubuntu SMP Fri Aug 2 15:21:01 UTC 2019
    ----------Hardware Info----------
    machine      : x86_64
    processor    : x86_64
    ----------Network Test----------
    Setting timeout: 10
    Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0011 sec, LOAD: 0.5163 sec.
    Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0004 sec, LOAD: 0.0498 sec.
    Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0004 sec, LOAD: 0.0210 sec.
    Timing for D2L: http://d2l.ai, DNS: 0.0004 sec, LOAD: 0.0183 sec.
    Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0003 sec, LOAD: 0.0373 sec.
    Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0003 sec, LOAD: 0.1360 sec.
    Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.0003 sec, LOAD: 0.0211 sec.
    Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0003 sec, LOAD: 0.3963 sec.

dmlc / gluon-nlp