dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.55k stars 538 forks source link

[TVM] TVM Integration Issue after changing to Boolean Mask. #1425

Closed sxjscience closed 3 years ago

sxjscience commented 3 years ago

Description

I'm changing the mask to use the boolean type in https://github.com/dmlc/gluon-nlp/pull/1405 to pass the AMP. However, it's causing issues in TVM integration. I created this issue to track this error and will skip the TVM test.

test_models.py:145: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../tvm/python/tvm/relay/frontend/mxnet.py:2869: in from_mxnet
    func = _from_mxnet_impl(symbol, shape, dtype, params, mod)
../../tvm/python/tvm/relay/frontend/mxnet.py:2792: in _from_mxnet_impl
    res = _convert_map[op_name](*op_params)
../../tvm/python/tvm/relay/frontend/mxnet.py:793: in _mx_batch_dot
    a_shape = _infer_type(a).checked_type.shape
../../tvm/python/tvm/relay/frontend/common.py:482: in infer_type
    new_mod = _transform.InferType()(new_mod)
../../tvm/python/tvm/ir/transform.py:127: in __call__
    return _ffi_transform_api.RunPass(self, mod)
tvm/_ffi/_cython/./packed_func.pxi:321: in tvm._ffi._cy3.core.PackedFuncBase.__call__
    ???
tvm/_ffi/_cython/./packed_func.pxi:256: in tvm._ffi._cy3.core.FuncCall
    ???
tvm/_ffi/_cython/./packed_func.pxi:245: in tvm._ffi._cy3.core.FuncCall3
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   tvm._ffi.base.TVMError: Traceback (most recent call last):
E     [bt] (7) /home/ubuntu/tvm/build/libtvm.so(TVMFuncCall+0x65) [0x7fd9125f2095]
E     [bt] (6) /home/ubuntu/tvm/build/libtvm.so(+0x6fc086) [0x7fd911bbc086]
E     [bt] (5) /home/ubuntu/tvm/build/libtvm.so(tvm::transform::ModulePassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x1ee) [0x7fd911bbb85e]
E     [bt] (4) /home/ubuntu/tvm/build/libtvm.so(+0xf662b8) [0x7fd9124262b8]
E     [bt] (3) /home/ubuntu/tvm/build/libtvm.so(+0xf65495) [0x7fd912425495]
E     [bt] (2) /home/ubuntu/tvm/build/libtvm.so(tvm::relay::TypeInferencer::Infer(tvm::GlobalVar, tvm::relay::Function)+0x67) [0x7fd912424947]
E     [bt] (1) /home/ubuntu/tvm/build/libtvm.so(tvm::relay::TypeSolver::Solve()+0xc37) [0x7fd9122b5d67]
E     [bt] (0) /home/ubuntu/tvm/build/libtvm.so(+0xdf21c2) [0x7fd9122b21c2]
E     [bt] (8) /home/ubuntu/tvm/build/libtvm.so(+0x6fc086) [0x7fd911bbc086]
E     [bt] (7) /home/ubuntu/tvm/build/libtvm.so(tvm::transform::ModulePassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x1ee) [0x7fd911bbb85e]
E     [bt] (6) /home/ubuntu/tvm/build/libtvm.so(+0xf662b8) [0x7fd9124262b8]
E     [bt] (5) /home/ubuntu/tvm/build/libtvm.so(+0xf65495) [0x7fd912425495]
E     [bt] (4) /home/ubuntu/tvm/build/libtvm.so(tvm::relay::TypeInferencer::Infer(tvm::GlobalVar, tvm::relay::Function)+0x67) [0x7fd912424947]
E     [bt] (3) /home/ubuntu/tvm/build/libtvm.so(tvm::relay::TypeSolver::Solve()+0x375) [0x7fd9122b54a5]
E     [bt] (2) /home/ubuntu/tvm/build/libtvm.so(std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::TypedPackedFunc<bool (tvm::runtime::Array<tvm::Type, void> const&, int, tvm::Attrs const&, tvm::TypeReporter const&)>::AssignTypedLambda<bool (*)(tvm::runtime::Array<tvm::Type, void> const&, int, tvm::Attrs const&, tvm::TypeReporter const&)>(bool (*)(tvm::runtime::Array<tvm::Type, void> const&, int, tvm::Attrs const&, tvm::TypeReporter const&))::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)+0x63b) [0x7fd911c0f36b]
E     [bt] (1) /home/ubuntu/tvm/build/libtvm.so(tvm::relay::BroadcastRel(tvm::runtime::Array<tvm::Type, void> const&, int, tvm::Attrs const&, tvm::TypeReporter const&)+0x350) [0x7fd91223f330]
E     [bt] (0) /home/ubuntu/tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x82) [0x7fd911a89ba2]
E     File "../src/relay/analysis/type_solver.cc", line 621
E   TVMError: 
E   ---------------------------------------------------------------
E   An internal invariant was violated during the execution of TVM.
E   Please read TVM's error reporting guidelines.
E   More details can be found here: https://discuss.tvm.ai/t/error-reporting/7793.
E   ---------------------------------------------------------------
E     Check failed: false == false: [21:48:08] ../src/relay/op/type_relations.cc:107: Check failed: t0->dtype == t1->dtype (float32 vs. bool) :
Zha0q1 commented 3 years ago

On my EC2 instance I commented out 'google_albert_base_v2' and the rest of the models seemed to work fine.

created this pr to re-enable the tests https://github.com/dmlc/gluon-nlp/pull/1437

Zha0q1 commented 3 years ago

The error I got:

==================================================== FAILURES ====================================================
____________________________ test_tvm_integration[ctx0-TN-1-4-google_albert_base_v2] _____________________________
model_name = 'google_albert_base_v2', batch_size = 1, seq_length = 4, layout = 'TN', ctx = cpu(0)
    @pytest.mark.serial
    @pytest.mark.seed(123)
    @pytest.mark.parametrize('model_name',
                             ['google_albert_base_v2'])
    @pytest.mark.parametrize('batch_size,seq_length', [(1, 4)])
    @pytest.mark.parametrize('layout', ['TN'])
    # @pytest.mark.skipif(not tvm_enabled(),
    #                    reason='TVM is not supported. So this test is skipped.')
    # @pytest.mark.skip('TVM issue https://github.com/dmlc/gluon-nlp/issues/1425.')
    def test_tvm_integration(model_name, batch_size, seq_length, layout, ctx):
        tvm = try_import_tvm()
        from tvm import relay
        from tvm.contrib import graph_runtime
        tvm_recommended_flags = get_ec2_tvm_flags()
        if ctx.device_type == 'gpu':
            flags = tvm_recommended_flags['g4']
        elif ctx.device_type == 'cpu':
            flags = tvm_recommended_flags['c4']
            if model_name != 'google_albert_base_v2':
                # Skip all other tests
                return
        else:
            raise NotImplementedError
        with tempfile.TemporaryDirectory() as root, ctx:
            model_cls, cfg, tokenizer, backbone_param_path, _ = get_backbone(model_name, root=root)
            cfg.defrost()
            cfg.MODEL.layout = layout
            cfg.freeze()
            model = model_cls.from_cfg(cfg)
            model.load_parameters(backbone_param_path)
            model.hybridize()
            if layout == 'NT':
                token_ids = mx.np.random.randint(0, cfg.MODEL.vocab_size, (batch_size, seq_length),
                                                 dtype=np.int32)
                token_types = mx.np.random.randint(0, 2, (batch_size, seq_length), dtype=np.int32)
                valid_length = mx.np.random.randint(seq_length // 2, seq_length, (batch_size,),
                                                    dtype=np.int32)
            else:
                token_ids = mx.np.random.randint(0, cfg.MODEL.vocab_size, (seq_length, batch_size),
                                                 dtype=np.int32)
                token_types = mx.np.random.randint(0, 2, (seq_length, batch_size), dtype=np.int32)
                valid_length = mx.np.random.randint(seq_length // 2, seq_length, (batch_size,),
                                                    dtype=np.int32)
            if 'bart' in model_name:
                mx_out = model(token_ids, valid_length, token_ids, valid_length)
                shape_dict = {
                    'data0': token_ids.shape,
                    'data1': valid_length.shape,
                    'data2': token_ids.shape,
                    'data3': valid_length.shape,
                }
                dtype_dict = {
                    'data0': token_ids.dtype.name,
                    'data1': valid_length.dtype.name,
                    'data2': token_ids.dtype.name,
                    'data3': valid_length.dtype.name,
                }
            elif 'roberta' in model_name or 'xlmr' in model_name:
                mx_out = model(token_ids, valid_length)
                shape_dict = {
                    'data0': token_ids.shape,
                    'data1': valid_length.shape,
                }
                dtype_dict = {
                    'data0': token_ids.dtype.name,
                    'data1': valid_length.dtype.name,
                }
            else:
                mx_out = model(token_ids, token_types, valid_length)
                shape_dict = {
                    'data0': token_ids.shape,
                    'data1': token_types.shape,
                    'data2': valid_length.shape
                }
                dtype_dict = {
                    'data0': token_ids.dtype.name,
                    'data1': token_types.dtype.name,
                    'data2': valid_length.dtype.name
                }
            sym = model._cached_graph[1]
            params = {}
            for k, v in model.collect_params().items():
                params[v._var_name] = tvm.nd.array(v.data().asnumpy())
>           mod, params = relay.frontend.from_mxnet(sym, shape=shape_dict, dtype=dtype_dict, arg_params=params)
tests/test_models.py:143: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../tvm/python/tvm/relay/frontend/mxnet.py:2869: in from_mxnet
    func = _from_mxnet_impl(symbol, shape, dtype, params, mod)
../tvm/python/tvm/relay/frontend/mxnet.py:2792: in _from_mxnet_impl
    res = _convert_map[op_name](*op_params)
../tvm/python/tvm/relay/frontend/mxnet.py:793: in _mx_batch_dot
    a_shape = _infer_type(a).checked_type.shape
../tvm/python/tvm/relay/frontend/common.py:482: in infer_type
    new_mod = _transform.InferType()(new_mod)
../tvm/python/tvm/ir/transform.py:127: in __call__
    return _ffi_transform_api.RunPass(self, mod)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
self = <tvm.runtime.packed_func.PackedFunc object at 0x7fc8404d4be0>
args = (Run Module pass: InferType at the optimization level 0, #[version = "0.0.5"]
def @main(%data2: Tensor[(1), int32], %v...=", 
    "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAAAAAAAAAgAQAEAAAAAAAAAAEAAAA="
  ], 
  "attrs": {"tvm_version": "0.8.dev0"}
})
temp_args = [], values = <tvm._ffi._ctypes.packed_func.TVMValue_Array_2 object at 0x7fc834ee1680>
tcodes = <mxnet._ffi._ctypes.function.c_int_Array_2 object at 0x7fc834ee1830>
    def __call__(self, *args):
        """Call the function with positional arguments
        args : list
           The positional arguments to the function call.
        """
        temp_args = []
        values, tcodes, num_args = _make_tvm_args(args, temp_args)
        ret_val = TVMValue()
        ret_tcode = ctypes.c_int()
        if (
            _LIB.TVMFuncCall(
                self.handle,
                values,
                tcodes,
                ctypes.c_int(num_args),
                ctypes.byref(ret_val),
                ctypes.byref(ret_tcode),
            )
            != 0
        ):
>           raise get_last_ffi_error()
E           tvm._ffi.base.TVMError: Traceback (most recent call last):
E             [bt] (7) /home/ubuntu/tvm/build/libtvm.so(TVMFuncCall+0x65) [0x7fc842ca3595]
E             [bt] (6) /home/ubuntu/tvm/build/libtvm.so(+0x7007c2) [0x7fc84225f7c2]
E             [bt] (5) /home/ubuntu/tvm/build/libtvm.so(tvm::transform::ModulePassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x1b7) [0x7fc84225f077]
E             [bt] (4) /home/ubuntu/tvm/build/libtvm.so(+0xfcee2f) [0x7fc842b2de2f]
E             [bt] (3) /home/ubuntu/tvm/build/libtvm.so(+0xfce085) [0x7fc842b2d085]
E             [bt] (2) /home/ubuntu/tvm/build/libtvm.so(tvm::relay::TypeInferencer::Infer(tvm::GlobalVar, tvm::relay::Function)+0x67) [0x7fc842b2c637]
E             [bt] (1) /home/ubuntu/tvm/build/libtvm.so(tvm::relay::TypeSolver::Solve()+0xd39) [0x7fc8429b3269]
E             [bt] (0) /home/ubuntu/tvm/build/libtvm.so(+0xe50402) [0x7fc8429af402]
E             [bt] (8) /home/ubuntu/tvm/build/libtvm.so(+0x7007c2) [0x7fc84225f7c2]
E             [bt] (7) /home/ubuntu/tvm/build/libtvm.so(tvm::transform::ModulePassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x1b7) [0x7fc84225f077]
E             [bt] (6) /home/ubuntu/tvm/build/libtvm.so(+0xfcee2f) [0x7fc842b2de2f]
E             [bt] (5) /home/ubuntu/tvm/build/libtvm.so(+0xfce085) [0x7fc842b2d085]
E             [bt] (4) /home/ubuntu/tvm/build/libtvm.so(tvm::relay::TypeInferencer::Infer(tvm::GlobalVar, tvm::relay::Function)+0x67) [0x7fc842b2c637]
E             [bt] (3) /home/ubuntu/tvm/build/libtvm.so(tvm::relay::TypeSolver::Solve()+0x36d) [0x7fc8429b289d]
E             [bt] (2) /home/ubuntu/tvm/build/libtvm.so(std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::TypedPackedFunc<bool (tvm::runtime::Array<tvm::Type, void> const&, int, tvm::Attrs const&, tvm::TypeReporter const&)>::AssignTypedLambda<bool (*)(tvm::runtime::Array<tvm::Type, void> const&, int, tvm::Attrs const&, tvm::TypeReporter const&)>(bool (*)(tvm::runtime::Array<tvm::Type, void> const&, int, tvm::Attrs const&, tvm::TypeReporter const&))::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)+0x7d7) [0x7fc8422b5f97]
E             [bt] (1) /home/ubuntu/tvm/build/libtvm.so(tvm::relay::BroadcastRel(tvm::runtime::Array<tvm::Type, void> const&, int, tvm::Attrs const&, tvm::TypeReporter const&)+0x404) [0x7fc842937014]
E             [bt] (0) /home/ubuntu/tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x82) [0x7fc84211beb2]
E             File "/home/ubuntu/tvm/src/relay/analysis/type_solver.cc", line 621
E           TVMError: 
E           ---------------------------------------------------------------
E           An internal invariant was violated during the execution of TVM.
E           Please read TVM's error reporting guidelines.
E           More details can be found here: https://discuss.tvm.ai/t/error-reporting/7793.
E           ---------------------------------------------------------------
E             Check failed: false == false: [20:17:50] /home/ubuntu/tvm/src/relay/op/type_relations.cc:107: 
E           ---------------------------------------------------------------
E           An internal invariant was violated during the execution of TVM.
E           Please read TVM's error reporting guidelines.
E           More details can be found here: https://discuss.tvm.ai/t/error-reporting/7793.
E           ---------------------------------------------------------------
E           
E             Check failed: t0->dtype == t1->dtype (float32 vs. int32) :
../tvm/python/tvm/_ffi/_ctypes/packed_func.py:237: TVMError
---------------------------------------------- Captured stdout call ----------------------------------------------
Downloading /tmp/tmpgmgdq8n2/google_albert_base_v2/spm-65999e5d.model from https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/google_albert_base_v2/spm-65999e5d.model...
Downloading /tmp/tmpgmgdq8n2/google_albert_base_v2/vocab-2ee53ae7.json from https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/google_albert_base_v2/vocab-2ee53ae7.json...
Downloading /tmp/tmpgmgdq8n2/google_albert_base_v2/model-125be477.params from https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/google_albert_base_v2/model-125be477.params...
---------------------------------------------- Captured stderr call ----------------------------------------------
100%|██████████| 760k/760k [00:00<00:00, 8.96MiB/s]
100%|██████████| 373k/373k [00:00<00:00, 9.29MiB/s]
100%|██████████| 46.7M/46.7M [00:01<00:00, 46.6MiB/s]
[20:17:50] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for CPU
============================================ short test summary info =============================================
FAILED tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_albert_base_v2] - tvm._ffi.base.TVMError: ...
=============================================== 1 failed in 2.99s ================================================
sxjscience commented 3 years ago

For me, I think one potential cause is that the TVM does not allow mixed data types in the where operartor, e.g., https://github.com/apache/incubator-tvm/blob/7649075fbb71ecab0a41c6fe4d41a86724e42e7a/python/tvm/relay/frontend/mxnet.py#L2419-L2434. Thus, we may print the dtypes of the cond, lhs and rhs to see if it's the root cause.

Zha0q1 commented 3 years ago

1437 passed CPU CI but on GPU the remaining three models still all failed

'google_en_cased_bert_base', 'google_electra_small', 'fairseq_bart_base'