Closed zysNLP closed 5 years ago
I found that the results should with shape=(25, 40, 768) not (25, 768). So I change codes as: X_train = [] for i in df_train['text1']: print(df_train['text1'].tolist().index(i)) i = sent_tokenize(i) a = bc.encode(i[:SEQUENCE_LEN_D]) a = [[list(j)]40 for j in a] X_train.extend(a) for k in range(max(SEQUENCE_LEN_D - (len(i)), 0)): X_train.append([[0]b_len]*SEQUENCE_LEN) # pad token maps to 0
to make them consistency
You should not need to modify the code, the matrix X_train should be of dimension (25, 40, 768), which is (max number of sentences, max number of words per sentence, BERT vector length for each word). When you initialize the BERT serving client, set the pooling_strategy to None, and it should resolve this issue.
OK! Thank you very much
When I used the following code from BERT_text_representation.py in line 54 to construct X_train, the X_train list was inconsistent:
SEQUENCE_LENGTH = 40 SEQUENCE_LENGTH_D = 25 SEQUENCE_LEN_D = SEQUENCE_LENGTH_D SEQUENCE_LEN = SEQUENCE_LENGTH
X_train = [] for i in train_essays[:1]:
print('processing train_essay', train_essays.index(i), 'in',len(train_essays),'...')
As my train_essays[:1] has 6 sentences, when I execute these codes, I got list X_train consist of 6 list with shape=(768) and (25-6)=19 list with shape=(768,40), so if I still execute the next line:X_train = np.array(X_train), It made X_train an "object" type. I think there must have something wrong.
So I change these codes " X_train.append([[0]b_len]SEQUENCE_LEN) " to be "X_train.append([[0]*b_len][0])" so that get X_train a list with same element,then if execute:X_train = np.array(X_train), get a np.array with shape=(25, 768).
I am not sure if these changes right, so looking forward to your reply,thank you very much!