dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.56k stars 538 forks source link

Loading BERT external vocab file error #902

Closed mohammedkhalilia closed 5 years ago

mohammedkhalilia commented 5 years ago

Description

When calling get_model(name, **kwargs), we can specify either the --dataset_name, which will load the vocab for that dataset, or we can pass the --vocab for custom BERT vocabulary. An error occurs when calling a BERT model such as bert_12_768_12 with --vocab and without passing --dataset_name.

Error Message

Traceback (most recent call last):
  File "finetune_ner.py", line 260, in <module>
    main(parse_args())
  File "finetune_ner.py", line 121, in main
    bert_model.load_parameters(config.parameters, ctx=ctx, ignore_extra=True)
  File "/env/mx_1.5_gnlp_0.8/lib/python3.5/site-packages/mxnet/gluon/block.py", line 410, in load_parameters
    params[name]._load_init(loaded[name], ctx, cast_dtype=cast_dtype, dtype_source=dtype_source)
  File "/env/mx_1.5_gnlp_0.8/lib/python3.5/site-packages/mxnet/gluon/parameter.py", line 279, in _load_init
    self.name, str(self.shape), str(data.shape))
AssertionError: Failed loading Parameter 'bertmodel0_word_embed_embedding0_weight' from saved params: shape incompatible expected (98, 768) vs saved (28996, 768)

Note that I did make minor modification to finetune_ner.py script, but that is irrelevant. The correct vocabulary size in this example is 28996 and the 98 is the length of the filename passed in the vocab argument in the function _load_vocab(dataset_name, vocab, root, cls=None)

To Reproduce

To reproduce, you need to convert a BERT to Gluon. In my case I convereted Clinical BERT TensorFlow model to Gluon using convert_tf_model.py. Then try loading a model using get_model(name, **kwargs). See instructions below.

Steps to reproduce

  1. Download Clinical BERT. See instructions here
  2. Convert any of the Clinical BERT models to Gluon using convert_tf_model.py
  3. Call get_model(name, **kwargs) without specifying --dataset_name and pass the converted vocabulary using --vocab argument.
    params = {
        'dataset_name': None,
        'vocab': 'path/to/filename.vocab',
        'pretrained': False,
        'ctx': ctx,
        'use_pooler': False,
        'use_decoder': False,
        'use_classifier': False,
        'dropout': 0.1,
        'embed_dropout': 0.1
    }

    bert_model, text_vocab = gluonnlp.model.get_model('bert_12_768_12', **params)

What have you tried to solve it?

_load_vocab(dataset_name, vocab, root, cls=None) returns the vocab variable (str) if --dataset_name is not set, hence the vocab length 98 you see in the error message above. To resolve the issue, I modified the function _load_vocab(dataset_name, vocab, root, cls=None) and added the following lines after line 273

with open(vocab, 'r') as fh:
    vocab = gluonnlp.Vocab().from_json(fh.read())
    return vocab

Environment

    Architecture:          x86_64
    CPU op-mode(s):        32-bit, 64-bit
    Byte Order:            Little Endian
    CPU(s):                64
    On-line CPU(s) list:   0-63
    Thread(s) per core:    2
    Core(s) per socket:    16
    Socket(s):             2
    NUMA node(s):          2
    Vendor ID:             GenuineIntel
    CPU family:            6
    Model:                 79
    Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
    Stepping:              1
    CPU MHz:               2089.316
    CPU max MHz:           3000.0000
    CPU min MHz:           1200.0000
    BogoMIPS:              4600.15
    Hypervisor vendor:     Xen
    Virtualization type:   full
    L1d cache:             32K
    L1i cache:             32K
    L2 cache:              256K
    L3 cache:              46080K
    NUMA node0 CPU(s):     0-15,32-47
    NUMA node1 CPU(s):     16-31,48-63
    Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt ida
    ----------Python Info----------
    Version      : 3.5.2
    Compiler     : GCC 5.4.0 20160609
    Build        : ('default', 'Nov 12 2018 13:43:14')
    Arch         : ('64bit', 'ELF')
    ------------Pip Info-----------
    Version      : 19.2.2
    Directory    : /env/mx_1.5_gnlp_0.8/lib/python3.5/site-packages/pip
    ----------MXNet Info-----------
    Version      : 1.5.0
    Directory    : /env/mx_1.5_gnlp_0.8/lib/python3.5/site-packages/mxnet
    Num GPUs     : 8
    Commit Hash   : 75a9e187d00a8b7ebc71412a02ed0e3ae489d91f
    ----------System Info----------
    Platform     : Linux-4.4.0-1090-aws-x86_64-with-Ubuntu-16.04-xenial
    system       : Linux
    node         : ip-172-31-30-122
    release      : 4.4.0-1090-aws
    version      : #101-Ubuntu SMP Fri Aug 2 15:21:01 UTC 2019
    ----------Hardware Info----------
    machine      : x86_64
    processor    : x86_64
    ----------Network Test----------
    Setting timeout: 10
    Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0011 sec, LOAD: 0.5163 sec.
    Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0004 sec, LOAD: 0.0498 sec.
    Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0004 sec, LOAD: 0.0210 sec.
    Timing for D2L: http://d2l.ai, DNS: 0.0004 sec, LOAD: 0.0183 sec.
    Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0003 sec, LOAD: 0.0373 sec.
    Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0003 sec, LOAD: 0.1360 sec.
    Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.0003 sec, LOAD: 0.0211 sec.
    Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0003 sec, LOAD: 0.3963 sec.
leezu commented 5 years ago

Thanks for reporting this issue @mohammedkhalilia. I think the problem is that get_model expects a gluonnlp.Vocab object for it's vocab parameter. However, you passed the filename to the json file from which the gluonnlp.Vocab object can be constructed. The correct usage would be to run

with open(vocab, 'r') as fh:
    vocab = gluonnlp.Vocab.from_json(fh.read())
params = {
        'dataset_name': None,
        'vocab': vocab,
        'pretrained': False,
        'ctx': ctx,
        'use_pooler': False,
        'use_decoder': False,
        'use_classifier': False,
        'dropout': 0.1,
        'embed_dropout': 0.1
    }
bert_model, text_vocab = gluonnlp.model.get_model('bert_12_768_12', **params)

Does that work for you? Closing the issue, but please reopen in case of any problems.

Also note gluonnlp.Vocab.from_json(fh.read()) instead of gluonnlp.Vocab().from_json(fh.read())