Feature request: using BERT and ELMo embedding in TextInputter

atebbifakhr commented 5 years ago

Hi,

Do you have any plan to leverage contextualized embeddings such as BERT and ELMo in TextInputter?

guillaumekln commented 5 years ago

Hi,

I will probably not work on that directly but I'm interested in making sure that users can integrate it without too much pain. Right now, it seems possible to extend TextInputter and override the make_inputs method. All this could be done directly in the user model definition file without changing the OpenNMT-tf code.

atebbifakhr commented 5 years ago

Thanks for information.

atebbifakhr commented 4 years ago

Hi,

I want to use the BERT representation in my model. Doing so I override WordEmbedder as followes:

class MyEmbedder(onmt.inputters.WordEmbedder):
    def make_features(self, element=None, features=None, training=None):
        features = super(MyEmbedder, self).make_features(
            element=element, features=features, training=training)
        def _python_wrapper(element):
            element = tf.compat.as_text(element.numpy())
            bert_tokenized = bert_tokenizer.encode_plus(
                element,
                add_special_tokens = True, # add [CLS], [SEP]
                max_length = 128, # max length of the text that can go to BERT
                pad_to_max_length = True, # add [PAD] tokens
                return_attention_mask = True, # add attention mask to not focus on pad tokens 
            )
            return bert_tokenized["input_ids"], bert_tokenized["attention_mask"], bert_tokenized["token_type_ids"]
        input_ids, attention_mask, token_type_ids = tf.py_function(_python_wrapper, [element], [tf.int32, tf.int32, tf.int32])
        features["bert_input_ids"] = input_ids
        features["bert_token_type_ids"] = token_type_ids
        features["bert_attention_mask"] = attention_mask
        return features

But I got this exception:

/usr/local/lib/python3.6/dist-packages/opennmt/inputters/inputter.py in make_training_dataset(self, features_file, labels_file, batch_size, batch_type, batch_multiplier, batch_size_multiple, shuffle_buffer_size, length_bucket_width, maximum_features_length, maximum_labels_length, single_pass, num_shards, shard_index, num_threads, prefetch_buffer_size, cardinality_multiple, weights) 577 shuffle_buffer_size=shuffle_buffer_size, 578 prefetch_buffer_size=prefetch_buffer_size, --> 579 cardinality_multiple=cardinality_multiple)(dataset) 580 return dataset

/usr/local/lib/python3.6/dist-packages/opennmt/data/dataset.py in _pipeline(dataset) 554 batch_size_multiple=batch_size_multiple, 555 length_bucket_width=length_bucket_width, --> 556 length_fn=[features_length_fn, labels_length_fn])) 557 dataset = dataset.apply(filter_irregular_batches(batch_multiplier)) 558 if not single_pass:

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py in apply(self, transformation_func) 1741 dataset. 1742 """ -> 1743 dataset = transformation_func(self) 1744 if not isinstance(dataset, DatasetV2): 1745 raise TypeError(

/usr/local/lib/python3.6/dist-packages/opennmt/data/dataset.py in (dataset) 324 """ 325 return lambda dataset: dataset.padded_batch( --> 326 batch_size, padded_shapes=padded_shapes or _get_output_shapes(dataset)) 327 328 def batch_sequence_dataset(batch_size,

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py in padded_batch(self, batch_size, padded_shapes, padding_values, drop_remainder) 1479 """ 1480 return PaddedBatchDataset(self, batch_size, padded_shapes, padding_values, -> 1481 drop_remainder) 1482 1483 def map(self, map_func, num_parallel_calls=None):

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py in init(self, input_dataset, batch_size, padded_shapes, padding_values, drop_remainder) 3811 nest.flatten(input_shapes), flat_padded_shapes): 3812 flat_padded_shapes_as_tensors.append( -> 3813 _padded_shape_to_tensor(padded_shape, input_component_shape)) 3814 3815 self._padded_shapes = nest.pack_sequence_as(input_shapes,

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py in _padded_shape_to_tensor(padded_shape, input_component_shape) 3721 # tf.TensorShape, so fall back on the conversion to tensor 3722 # machinery. -> 3723 ret = ops.convert_to_tensor(padded_shape, preferred_dtype=dtypes.int64) 3724 if ret.shape.dims is not None and len(ret.shape.dims) != 1: 3725 six.reraise(ValueError, ValueError(

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, dtype_hint, ctx, accepted_result_types) 1312 1313 if ret is None: -> 1314 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) 1315 1316 if ret is NotImplemented:

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/constant_op.py in _tensor_shape_tensor_conversion_function(s, dtype, name, as_ref) 332 if not s.is_fully_defined(): 333 raise ValueError( --> 334 "Cannot convert a partially known TensorShape to a Tensor: %s" % s) 335 s_list = s.as_list() 336 int64_value = 0

ValueError: Cannot convert a partially known TensorShape to a Tensor:

Do you know what is the problem?

Thanks a lot!

atebbifakhr commented 4 years ago

I'd like to mention I'm using Tensorflow 2.2 and the last version of OpenNMT

guillaumekln commented 4 years ago

Can you check if the shape of the tensors returned by tf.py_function is defined? You may need to set it manually, see for example:

https://github.com/OpenNMT/OpenNMT-tf/blob/v2.9.3/opennmt/tokenizers/tokenizer.py#L163-L164

OpenNMT / OpenNMT-tf

Feature request: using BERT and ELMo embedding in TextInputter #422