Farahn / AES

Automatic Essay Scoring
33 stars 12 forks source link

An inconsistency occurred during the construction of X_train #2

Closed zysNLP closed 5 years ago

zysNLP commented 5 years ago

When I used the following code from BERT_text_representation.py in line 54 to construct X_train, the X_train list was inconsistent:

SEQUENCE_LENGTH = 40 SEQUENCE_LENGTH_D = 25 SEQUENCE_LEN_D = SEQUENCE_LENGTH_D SEQUENCE_LEN = SEQUENCE_LENGTH

X_train = [] for i in train_essays[:1]:

print('processing train_essay', train_essays.index(i), 'in',len(train_essays),'...')

i = sent_tokenize(i)
X_train.extend(bc.encode(i[:SEQUENCE_LEN_D]).tolist())
for k in range(max(SEQUENCE_LEN_D - (len(i)), 0)):
    X_train.append([[0]*b_len]*SEQUENCE_LEN) # pad token maps to 0

As my train_essays[:1] has 6 sentences, when I execute these codes, I got list X_train consist of 6 list with shape=(768) and (25-6)=19 list with shape=(768,40), so if I still execute the next line:X_train = np.array(X_train), It made X_train an "object" type. I think there must have something wrong.

So I change these codes " X_train.append([[0]b_len]SEQUENCE_LEN) " to be "X_train.append([[0]*b_len][0])" so that get X_train a list with same element,then if execute:X_train = np.array(X_train), get a np.array with shape=(25, 768).

I am not sure if these changes right, so looking forward to your reply,thank you very much!

zysNLP commented 5 years ago

I found that the results should with shape=(25, 40, 768) not (25, 768). So I change codes as: X_train = [] for i in df_train['text1']: print(df_train['text1'].tolist().index(i)) i = sent_tokenize(i) a = bc.encode(i[:SEQUENCE_LEN_D]) a = [[list(j)]40 for j in a] X_train.extend(a) for k in range(max(SEQUENCE_LEN_D - (len(i)), 0)): X_train.append([[0]b_len]*SEQUENCE_LEN) # pad token maps to 0

to make them consistency

Farahn commented 5 years ago

You should not need to modify the code, the matrix X_train should be of dimension (25, 40, 768), which is (max number of sentences, max number of words per sentence, BERT vector length for each word). When you initialize the BERT serving client, set the pooling_strategy to None, and it should resolve this issue.

zysNLP commented 5 years ago

OK! Thank you very much