codekansas / keras-language-modeling

:book: Some language modeling tools for Keras
https://codekansas.github.io/language
MIT License
658 stars 171 forks source link

Any idea to get the performance to 70% #20

Open wailoktam opened 7 years ago

wailoktam commented 7 years ago

Hi, I mean without doing something not in the paper dos santos 2016

I am mentioning 70% coz it is what the author of this paper reported on using the LSTM+ attention with the insuranceQA data. I get 40 something like codekansas. Can I be confident in blaming dos santos in faking the result?

codekansas commented 7 years ago

I get the sense that it has something to do with finely tuning the hyperparameters. Or maybe they used better pre-trained embeddings... The best result I've gotten so far was around 55% using a generative RNN model plus an embedding layer, although I was hoping it would be better. I would be really interested to see if someone can duplicate their results.

codekansas commented 7 years ago

I was looking through some of it yesterday and realized my GESD implementation was broken. The fixed one is in the repo now, try with that. It may give better results, I'm not sure.

eshijia commented 7 years ago

hi, @codekansas I have trained with you latest code in the repo. The result seems not better than before:

Epoch 49 :: 2016-08-04 00:35:19 :: Train on 16686 samples, validate on 1854 samples Epoch 1/1 16686/16686 [==============================] - 1071s - loss: 6.2260e-04 - val_loss: 9.1324e-04 Best: Loss = 0.000833536147339, Epoch = 36 2016-08-04 00:53:19 :: ----- test1 ----- [====================]Top-1 Precision: 0.316667 MRR: 0.469865 2016-08-04 01:05:08 :: ----- test2 ----- [====================]Top-1 Precision: 0.327222 MRR: 0.478177 2016-08-04 01:16:38 :: ----- dev ----- [====================]Top-1 Precision: 0.335000 MRR: 0.491637

wailoktam commented 7 years ago

hi, @codekansas @eshijia I am trying other similarity to see how things will go. I have also changed the training part a bit to make sure the bad answers are bad answers when it randomly draws an answer from the answer pool. I will keep you guys post

codekansas commented 7 years ago

I trained the attention model and printed out some predicted and expected answers, then dumped them in this gist. You guys can decide for yourself. I'm more or less ready to change datasets. The top-1 precision was still much worse than the basic embedding model.

eshijia commented 7 years ago

There is a theano version for this task (and the paper). The results are identical with the paper. For the ConvolutionModel of keras_models.py, I have tried almost same hyper-parameter just as the theano version do, but it doesn't give better results.

I haven't read the theano code carefully, but i believe the implementation is different from ours. When I have enough time, I will try to hack the code to find out whether I can improve it.

wailoktam commented 7 years ago

Just to report back that I have no luck with my try with cosine similarity

wailoktam commented 7 years ago

Hi, I suggest we try using the V2 data. There is a choice of pool size. I think they may get the 70% by using the smallest pool.

codekansas commented 7 years ago

I noticed the two scripts run for 2000000 (CNN) and 20000000 (LSTM+CNN) batches, I think it must have taken a really long time to train. The results I included were after training for only about 30000 batches.

wailoktam commented 7 years ago

20000000! That does not look realistic for departments without access to super computer? It takes me a day to run 100 epoch + 20 batch size. I need 10000 days to get to 70%...

eshijia commented 7 years ago

I have asked the author of the theano version. He told me that it took about 1 day to run for 20000000 epochs with his Tesla GPU. But i don't think it really needs 2000000 epochs. In addition, he used character level embeddings.

codekansas commented 7 years ago

Wow, I did not realize the Teslas are so fast... I'll just run it for a while on my 980ti I suppose. Character level embeddings though? It looks like regular word embeddings here

I would really like to replicate their result haha

wailoktam commented 7 years ago

Hi, will the code run on tensorflow backend in its current state? I am asking this question because I think I need to run it on multiple gpu to improve training speed. This thread says that Keras would support multiple GPU when running with tensorflow backend but not theano backend. If it cannot run on tensorflow backend at the moment, how can I change hopefully a couple of lines to get it run on tensorflow?

codekansas commented 7 years ago

I think the performance really depends on how long you run it. I ran a CNN-LSTM model for ~700 epochs and got a precision of 0.52, going to run it longer to see if it improves.

conf = {
    'question_len': 150,
    'answer_len': 150,
    'n_words': 22353, # len(vocabulary) + 1
    'margin': 0.05,

    'training_params': {
        'print_answers': False,
        'save_every': 1,
        'batch_size': 100,
        'nb_epoch': 3000,
        'validation_split': 0.1,
        'optimizer': SGD(lr=0.05), # Adam(clipnorm=1e-2),
    },

    'model_params': {
        'n_embed_dims': 100,
        'n_hidden': 200,

        # convolution
        'nb_filters': 500,  # * 4
        'conv_activation': 'tanh',

        # recurrent
        'n_lstm_dims': 141,  # * 2

        'initial_embed_weights': np.load('models/word2vec_100_dim.h5'),
        'similarity_dropout': 0.25,
    },

    'similarity_params': {
        'mode': 'gesd',
        'gamma': 1,
        'c': 1,
        'd': 2,
    }
}

evaluator = Evaluator(conf)

##### Define model ######
model = CNNLSTM(conf)
optimizer = conf.get('training_params', dict()).get('optimizer', 'rmsprop')
model.compile(optimizer=optimizer)

# train the model
best_loss = evaluator.train(model)
evaluator.load_epoch(model, best_loss['epoch'])
evaluator.get_score(model, evaluate_all=True)
class CNNLSTM(LanguageModel):
    def build(self):
        question = self.question
        answer = self.get_answer()

        # add embedding layers
        weights = self.model_params.get('initial_embed_weights', None)
        weights = weights if weights is None else [weights]
        embedding = Embedding(input_dim=self.config['n_words'],
                              output_dim=self.model_params.get('n_embed_dims', 100),
                              weights=weights,
                              # mask_zero=True)
                              mask_zero=False)
        question_embedding = embedding(question)
        answer_embedding = embedding(answer)

        f_rnn = LSTM(self.model_params.get('n_lstm_dims', 141), return_sequences=True, consume_less='mem')
        b_rnn = LSTM(self.model_params.get('n_lstm_dims', 141), return_sequences=True, consume_less='mem')

        qf_rnn = f_rnn(question_embedding)
        qb_rnn = b_rnn(question_embedding)
        question_pool = merge([qf_rnn, qb_rnn], mode='concat', concat_axis=-1)

        af_rnn = f_rnn(answer_embedding)
        ab_rnn = b_rnn(answer_embedding)
        answer_pool = merge([af_rnn, ab_rnn], mode='concat', concat_axis=-1)

        # cnn
        cnns = [Convolution1D(filter_length=filter_length,
                          nb_filter=self.model_params.get('nb_filters', 500),
                          activation=self.model_params.get('conv_activation', 'tanh'),
                          # W_regularizer=regularizers.l1(1e-4),
                          # activity_regularizer=regularizers.activity_l1(1e-4),
                          border_mode='same') for filter_length in [1, 2, 3, 5]]
        question_cnn = merge([cnn(question_pool) for cnn in cnns], mode='concat')
        answer_cnn = merge([cnn(answer_pool) for cnn in cnns], mode='concat')

        maxpool = Lambda(lambda x: K.max(x, axis=1, keepdims=False), output_shape=lambda x: (x[0], x[2]))
        question_pool = maxpool(question_cnn)
        answer_pool = maxpool(answer_cnn)

        return question_pool, answer_pool
codekansas commented 7 years ago

Ended up with

Best: Loss = 0.001460216869, Epoch = 879
2016-08-14 05:58:27 :: ----- test1 -----
[====================]Top-1 Precision: 0.564444
MRR: 0.680506
2016-08-14 06:17:06 :: ----- test2 -----
[====================]Top-1 Precision: 0.543333
MRR: 0.661070
2016-08-14 06:35:26 :: ----- dev -----
[====================]Top-1 Precision: 0.573000
MRR: 0.685989

after training for about 4-5 days on my 980ti. I can see how after enough iterations you could get up to ~60-70%, but my GPU would take way too long...

eshijia commented 7 years ago

Sounds great! I would like to follow your training progress. The duration of one epoch with the CNNLSTM model is 490s. It will take about 17 days to complete 3000 epochs. My GPU device is Tesla K20c. By the way, I think another important thing is to let the code fit the latest keras version :)

wailoktam commented 7 years ago

@eshijia Hi, is 3000 epoch enough? I think you mention 20000000 batches. I assume you did not touch the default batch size 128? 3000 x 128 is 38400. Or do I get the idea of batches wrong?

codekansas commented 7 years ago

17 days seems slow for that GPU? I wonder if it is slow for some reason (maybe it's running on the CPU instead of the GPU?) But 3000 epochs * 16686 samples per epoch is 50,058,000 samples, where as the other script it was 20,000,000 * 128 or 2,560,000,000 samples. On my GPU (980ti) it will take ~6.4 days to train 3000 epochs, it would take nearly a year to train as many samples as their model used on my GPU.

Also, I found a big difference in training while using different optimizers. I think the Adadelta optimizer works well, RMSprop was overfitting a lot.

eshijia commented 7 years ago

It is really running on GPU. There are four GPU devices (K20c) in my server, and each of them always runs different tasks at the same time. I can see that the GPU-Util of the device used to run this task is 96% with the command nvidia-smi. I don't understand why it is running slow.

eshijia commented 7 years ago

I think my Tesla GPU is really old. The configuration is not up to 980ti.

eshijia commented 7 years ago

@wailoktam Could you share how you change the training part to make sure the bad answers are really bad answers?

wailoktam commented 7 years ago

My pleasure.

        save_every = self.params.get('save_every', None)
        batch_size = self.params.get('batch_size', 128)
        nb_epoch = self.params.get('nb_epoch', 10)
        split = self.params.get('validation_split', 0)
training_set = self.load('train')

questions = list()
good_answers = list()

for q in training_set:
    questions += [q['question']] * len(q['answers'])
    good_answers += [self.answers[i] for i in q['answers']]

questions = self.padq(questions)
good_answers = self.pada(good_answers)

val_loss = {'loss': 1., 'epoch': 0}

for i in range(1, nb_epoch):

    bad_answers = self.pada(random.sample(self.answers.values(), len(good_answers)))
    fixed_bads = np.empty((0, 100), int)
    qCounter = 0

    for (gs, bs) in zip(good_answers, bad_answers):
        print ("bad answer shape")
        print (bs.shape)
        print("fixed bad shape")
        print (fixed_bads.shape)

        if not (gs == bs).all():

            fixed_bads = np.append(fixed_bads, [bs], axis=0)

        else:

            fixed_bads = np.append(fixed_bads, self.pada(random.sample(self.answers.values(), 1)), axis=0)

        qCounter += 1

    print('Epoch %d :: ' % i, end='')
    bad_answers = fixed_bads
    self.print_time()

    print('Epoch %d :: ' % i, end='')
    self.print_time()
    hist = model.fit([questions, good_answers, bad_answers], nb_epoch=1, batch_size=batch_size,
                     validation_split=split)

    if hist.history['val_loss'][0] < val_loss['loss']:
        val_loss = {'loss': hist.history['val_loss'][0], 'epoch': i}
    print('Best: Loss = {}, Epoch = {}'.format(val_loss['loss'], val_loss['epoch']))

    if save_every is not None and i % save_every == 0:
        self.save_epoch(model, i)

return val_loss
wailoktam commented 7 years ago

I think I can also share the version2 insurance data and Japanese wiki data, which I have structured in a way to be used for this great work of codekansas. However, I am running them without pretrained word2vec weights. The reason is that the program complains about the different size of vocabularies. As you guys can tell, without the pretrained weights, it will even take longer time to get the 70% claimed.

eshijia commented 7 years ago

I have tried to train about 3000 epochs for the CNNLSTM model, and the loss is stable at about 0.0013. The test results are just same as @codekansas mentioned above.

2016-08-25 02:48:36 :: ----- test1 -----
[====================]Top-1 Precision: 0.571667
MRR: 0.684311
2016-08-25 03:25:51 :: ----- test2 -----
[====================]Top-1 Precision: 0.543333
MRR: 0.660048
2016-08-25 04:01:48 :: ----- dev -----
[====================]Top-1 Precision: 0.564000
MRR: 0.682626