bojone / bert4keras

keras implement of transformers for humans
https://kexue.fm/archives/6915
Apache License 2.0
5.36k stars 926 forks source link

BERT编码之后如何引入特征(人工设计领域特征) #360

Open konL opened 3 years ago

konL commented 3 years ago

提问时请尽可能提供如下信息:

基本信息

核心代码


def build_bert_model(config_path,checkpoint_path,class_nums):
    bert = build_transformer_model(....)
    cls_features=....
    all_token_embedding =....
    cnn_features = textcnn(all_token_embedding, bert.initializer)  
    ......
    output = .......

    model = keras.models.Model(bert.model.input, output)
        return model

def create_mlp(dim, regress=False):
    model = Sequential()
    model.add(Dense(8, input_dim=dim, activation="relu"))
    model.add(Dense(4, activation="relu"))
    if regress:
        model.add(Dense(1, activation="linear"))
    return model
..........................................................
#融合模型输入
mlp = m.create_mlp(features.shape[1], regress=False)
cnn = m.build_bert_model(config_path,checkpoint_path,2)
combinedInput = concatenate([mlp.output, cnn.output])
x = Dense(4, activation="relu")(combinedInput)
x = Dense(1, activation="linear")(x)
model = Model(inputs=[mlp.input, cnn.input], outputs=x)

输出信息

# 请在此处贴上你的调试输出
ValueError: Input tensors to a Model must come from `keras.layers.Input`. Received: [<tf.Tensor 'Input-Token:0' shape=(None, None) dtype=float32>, <tf.Tensor 'Input-Segment:0' shape=(None, None) dtype=float32>] (missing previous layer metadata).

自我尝试

【目的】在Bert场景中在编码之后,引入特征(如KG信息,或者人工设计领域特征),与Bert编码向量融合后再用于预测任务。 【思路】

  1. 通过bert获取embedding输出到后续的cnn模型中:通过data_generator生成的train_genertor输入到模型中(所以BERT的输入就是token ids,segment ids和labels)
    
    #模型
    def build_bert_model(config_path,checkpoint_path,class_nums):
    bert = build_transformer_model(.......)
    cls_features =(.....)(bert.model.output)  # shape=[batch_size,768]
    all_token_embedding = (.......)(bert.model.output)  # shape=[batch_size,maxlen-2,768]
    cnn_features = textcnn(all_token_embedding.......)  # shape=[batch_size,cnn_output_dim]
        ...............
        model = keras.models.Model(bert.model.input, output)

模型输入

class data_generator(DataGenerator): """ 数据生成器 """ def iter(self, random=False): batch_token_ids, batch_segment_ids, batch_labels = [], [], [] for is_end, (text,text1, label) in self.sample(random):
token_ids, segment_ids = tokenizer.encode(text,text1, maxlen=maxlen) batch_token_ids.append(token_ids) batch_segment_ids.append(segment_ids) batch_labels.append([label]) if len(batch_token_ids) == self.batch_size or is_end: batch_token_ids = sequence_padding(batch_token_ids) batch_segment_ids = sequence_padding(batch_segment_ids) batch_labels = sequence_padding(batch_labels) yield [batch_token_ids, batch_segment_ids], batch_labels batch_token_ids, batch_segment_ids, batch_labels = [], [], []


2. 准备好的结构化数据csv(20列是label),对其进行数据归一化等处理之后也是一个embedding

<meta name="Generator" content="Microsoft Excel">
<!--[if !mso]>
<style>
v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
x\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style>
<![endif]-->
<style>
<!--.font0
    {color:#000000;
    font-size:11.0pt;
    font-family:宋体;
    font-weight:400;
    font-style:normal;
    text-decoration:none;}
br
    {mso-data-placement:same-cell;}
td
    {padding-top:1px;
    padding-left:1px;
    padding-right:1px;
    mso-ignore:padding;
    color:#000000;
    font-size:11.0pt;
    font-weight:400;
    font-style:normal;
    text-decoration:none;
    font-family:宋体;
    mso-generic-font-family:auto;
    mso-font-charset:134;
    mso-number-format:General;
    border:none;
    mso-background-source:auto;
    mso-pattern:auto;
    text-align:general;
    vertical-align:middle;
    white-space:nowrap;
    mso-rotate:0;
    mso-protection:locked visible;}
-->
</style>

<!--StartFragment-->

0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | &nbsp;
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
21 | 4 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0.75 | 0 | &nbsp;
24 | 6 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 1 | 0.666666667 | 0 | &nbsp;

<!--EndFragment-->

3.一般多输入的模型处理如下,bert的输入似乎不是Input layer,应该如何与已有的特征融合输入到模型中?

mlp = models.create_mlp(....) cnn = models.create_cnn(.......) combinedInput = concatenate([mlp.output, cnn.output]) x = Dense(4, activation="relu")(combinedInput) x = Dense(1, activation="linear")(x) model = Model(inputs=[mlp.input, cnn.input], outputs=x)

bojone commented 3 years ago

keras的输入,要不然是一个Input层,要不然是Input层组成的list

请问你是否肉眼观察过[mlp.input, cnn.input]是一个什么东西?

konL commented 3 years ago

mlp.input是input层吧,但是cnn.input是一个tensor的list [<tf.Tensor 'Input-Token:0' shape=(None, None) dtype=float32>, <tf.Tensor 'Input-Segment:0' shape=(None, None) dtype=float32>]

Chrisma-98 commented 2 years ago

我的做法:修改zip_data()、data_generator、和模型结构(从keras.layers导入Input和Concatenate) image

image

image

训练代码无需修改,用这种方式我跑通了。刚学的keras,应该还有更好的方法。

WMeng1 commented 1 year ago

@Chrisma-98 同样遇到这个问题想请问一下

model = build_transformer_model(config_path, checkpoint_path, model='albert') category = Input(shape=(1,), name='feature_input') categoryEmbedding = Embedding(1, 312)(category) categoryEmbedding = Flatten()(categoryEmbedding) // 特征拼接 input = model.input input.append(category)

output = concatenate([model.output, model.input[2]]) output = GlobalPointer(len(categories), 64)(output)

model = Model(input, output) model.summary()

如果使用bert.output作为输出的情况下,这种(None,None,dims)和(None,dim)的拼接要怎么处理呢,直接拼接会报维度不一致的问题。