allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.76k stars 2.25k forks source link

performance of AllenNLP TE model #379

Closed JunhaoZhang1992 closed 7 years ago

JunhaoZhang1992 commented 7 years ago

Hi, I have tried to train TE model with experiment_config/decomposable_attention.json, and I adjust the epoch number to 3000. but I have only got the best validation accuracy of 72.8%,with training accuracy 67.2% when epoch is 38. How can I get an accuracy above 80% ?

matt-gardner commented 7 years ago

@DeNeutoy, is that config file up to date with what you used to train the model?

DeNeutoy commented 7 years ago

Yes, it is up to date. @JunhaoZhang1992 how are you running allennlp? I have confirmed that running the docker image replicates the results. Have you changed the data that you are running the model on?

JunhaoZhang1992 commented 7 years ago

@DeNeutoy , I install allennlp in Conda development environment, which give a resonable performance on simple_tagger.

I download SNLI dataset from https://nlp.stanford.edu/projects/snli/snli_1.0.zip. No further process on the text. snli_1.0_train.jsonl is 465M of size. snli_1.0_dev.jsonl is 9.3M of size.

For pretrained word embedding, glove.6B.300d.txt.gz (377M size) is used according to the config file, which is the compressed result of glove.6B.300d.txt from glove.6B.zip.

I was just recognized the update of da-config(#272). The config file I have used is the early version, where I only change the num_epochs and patience to very large values.

The running command I have used is 'python -m allennlp.run train decomposable_attention.json --serialization_dir ./SNLI_RES'.

DeNeutoy commented 7 years ago

Ah yes, the updated version is critical - without the good initialisation, I also struggled to get it to learn anything. Let me know if you still can't get it to work with the config which is in the current master.

kellywzhang commented 7 years ago

I am running into the same problem as the performance numbers I'm getting are not what Allennlp reports. What should I do to fix that?

kellywzhang commented 7 years ago

Actually, I realized that my initialization and number of epochs is correct in my config file.

kellywzhang commented 7 years ago

I only got a validation accuracy of 0.79 after 140 epochs. I'm cloning the repo again and running the default config file just to double check that it's not due to any changes I made in the repo.

kellywzhang commented 7 years ago

After cloning the most recent version of the allennlp repo and running on the default configurations in experiment_config,

For BiDAF on SQuAD I got: Validation em : 0.674078 Validation f1 : 0.768223

For Decomposable Attention on SNLI I got (technically hasn't completed yet but on epoch 113 out of 139 max in the config file): Validation accuracy : 0.786934

I'm using a docker image for Allennlp and am using the 6B GloVe vectors. The performance seems lower than I'd expect given what Allennlp reports on the website, especially for the decomposable attention model. Do you think it could be a problem on my side or are there potentially extra configuration parameters included in your runs?

matt-gardner commented 7 years ago

It is well understood that training neural nets, especially deep neural nets, has some variance that is due only to the random seed they were trained with. That variance can sometimes be surprisingly large. See, e.g., this paper for some recent discussion on that topic.

AllenNLP provides reference implementations of published models, where we do our best to mimic the computation performed by the reported model, and a set of trained weights that come as close as we can manage to the originally-reported performance. However, the originally-reported performance very likely just took a max over a large number of samples from the training distribution, especially if the paper came from Google*, like the decomposable attention model did. We also took the max over the training runs we did when picking a model to put on the website. I haven't characterized the variance of the decomposable attention model yet (@DeNeutoy might have better insight there), but I know that BiDAF can get anywhere from low 67 EM to high 68 EM, almost two full points difference, depending on the random seed.

Anyone doing research with neural nets needs to be familiar with these issues, and should not expect to be able to reproduce the maximum performance that we saw in our training samples on their first training run, even with identical code and configuration (and, because of CUDA optimizations, not even with the same CPU/GPU architecture and random seed!**). That's one of the reasons why we provide the max from the samples we've taken, so you don't have to spend your GPU time reproducing that result. We do need to do a better job explaining this on our website, though, and it's on our list of things to do. We are just a very small team, and we haven't gotten to it yet.

* No offense intended to Google, they just have a whole lot of compute available and do very large grid searches over hyper-parameters. If the variance due to the random seed is larger than the effect the hyper-parameters have on performance, you're effectively just increasing the number of samples from your training distribution.

** Interestingly, we have tests that ensure that we get the same predictions from a model directly after training it as we do when we save and then load the model. We have to set a surprisingly high tolerance on the floating point comparisons in that test, and, when we switched to using torch.bmm for some of our computations, which is more highly optimized by the underlying tensor library than what we were doing before, we had to mark the tests as flaky, because they fail so often. And this is the same model with identical weights!

kellywzhang commented 7 years ago

Thanks for the detailed response! I understand that there is variance in performance with different random seeds.

I think the BiDAF performance has no issues really replicating the original paper. I do feel though that the performance I'm getting for the decomposable attention model seems significantly lower than the paper (I got Validation accuracy : 0.786934 and the test performance in the paper is around 0.86). I feel like this gap is larger than one would expect due to randomness. If you have any ideas as to how to improve the performance of this model please let me know!

DeNeutoy commented 7 years ago

@kellywzhang Yes, the decomposable attention model should be more performant. Can you give me more details:

kellywzhang commented 7 years ago
017-10-14 16:45:36,304 - INFO - allennlp.common.params - dataset_reader.type = snli
2017-10-14 16:45:36,305 - INFO - allennlp.common.params - dataset_reader.tokenizer.type = word
2017-10-14 16:45:36,305 - INFO - allennlp.common.params - dataset_reader.tokenizer.word_splitter.type = spacy
2017-10-14 16:45:36,305 - INFO - allennlp.common.params - dataset_reader.tokenizer.word_splitter.language = en
2017-10-14 16:45:36,305 - INFO - allennlp.common.params - dataset_reader.tokenizer.word_splitter.pos_tags = False
2017-10-14 16:45:36,305 - INFO - allennlp.common.params - dataset_reader.tokenizer.word_splitter.parse = False
2017-10-14 16:45:36,305 - INFO - allennlp.common.params - dataset_reader.tokenizer.word_splitter.ner = False
2017-10-14 16:45:38,513 - INFO - allennlp.common.params - dataset_reader.tokenizer.word_filter.type = pass_through
2017-10-14 16:45:38,514 - INFO - allennlp.common.params - dataset_reader.tokenizer.word_stemmer.type = pass_through
2017-10-14 16:45:38,514 - INFO - allennlp.common.params - dataset_reader.tokenizer.start_tokens = None
2017-10-14 16:45:38,514 - INFO - allennlp.common.params - dataset_reader.tokenizer.end_tokens = ['@@NULL@@']
2017-10-14 16:45:38,514 - INFO - allennlp.common.params - dataset_reader.token_indexers.tokens.type = single_id
2017-10-14 16:45:38,514 - INFO - allennlp.common.params - dataset_reader.token_indexers.tokens.namespace = tokens
2017-10-14 16:45:38,514 - INFO - allennlp.common.params - dataset_reader.token_indexers.tokens.lowercase_tokens = True
2017-10-14 16:45:38,515 - INFO - allennlp.common.params - train_data_path = /scratch/k/Data/snli/snli_1.0/snli_1.0_train.jsonl
2017-10-14 16:45:38,515 - INFO - allennlp.commands.train - Reading training data from /scratch/k/Data/snli/snli_1.0/snli_1.0_train.jsonl
2017-10-14 16:45:38,517 - INFO - allennlp.data.dataset_readers.snli - Reading SNLI instances from jsonl dataset at: /scratch/k/Data/snli/snli_1.0/snli_1.0_train.jsonl
2017-10-14 16:47:03,432 - INFO - allennlp.common.params - validation_data_path = /scratch/k/Data/snli/snli_1.0/snli_1.0_dev.jsonl
2017-10-14 16:47:03,433 - INFO - allennlp.commands.train - Reading validation data from /scratch/k/Data/snli/snli_1.0/snli_1.0_dev.jsonl
2017-10-14 16:47:03,435 - INFO - allennlp.data.dataset_readers.snli - Reading SNLI instances from jsonl dataset at: /scratch/k/Data/snli/snli_1.0/snli_1.0_dev.jsonl
2017-10-14 16:47:04,752 - INFO - allennlp.common.params - test_data_path = /scratch/k/Data/snli/snli_1.0/snli_1.0_test.jsonl
2017-10-14 16:47:04,752 - INFO - allennlp.commands.train - Reading test data from /scratch/k/Data/snli/snli_1.0/snli_1.0_test.jsonl
2017-10-14 16:47:04,754 - INFO - allennlp.data.dataset_readers.snli - Reading SNLI instances from jsonl dataset at: /scratch/k/Data/snli/snli_1.0/snli_1.0_test.jsonl
2017-10-14 16:47:06,058 - INFO - allennlp.commands.train - Creating a vocabulary using train, validation, test data.
2017-10-14 16:47:06,908 - INFO - allennlp.common.params - vocabulary.directory_path = None
2017-10-14 16:47:06,908 - INFO - allennlp.common.params - vocabulary.min_count = 1
2017-10-14 16:47:06,908 - INFO - allennlp.common.params - vocabulary.max_vocab_size = None
2017-10-14 16:47:06,909 - INFO - allennlp.common.params - vocabulary.non_padded_namespaces = ('*tags', '*labels')
2017-10-14 16:47:06,909 - INFO - allennlp.data.vocabulary - Fitting token dictionary from dataset.
2017-10-14 16:47:30,023 - INFO - allennlp.common.params - model.type = decomposable_attention
2017-10-14 16:47:30,024 - INFO - allennlp.common.params - model.text_field_embedder.type = basic
2017-10-14 16:47:30,025 - INFO - allennlp.common.params - model.text_field_embedder.tokens.type = embedding
2017-10-14 16:47:30,025 - INFO - allennlp.common.params - model.text_field_embedder.tokens.num_embeddings = None
2017-10-14 16:47:30,025 - INFO - allennlp.common.params - model.text_field_embedder.tokens.vocab_namespace = tokens
2017-10-14 16:47:30,025 - INFO - allennlp.common.params - model.text_field_embedder.tokens.embedding_dim = 300
2017-10-14 16:47:30,025 - INFO - allennlp.common.params - model.text_field_embedder.tokens.pretrained_file = /scratch/k/Data/glove.6B/glove.6B.300d.txt.gz
2017-10-14 16:47:30,025 - INFO - allennlp.common.params - model.text_field_embedder.tokens.projection_dim = 200
2017-10-14 16:47:30,025 - INFO - allennlp.common.params - model.text_field_embedder.tokens.trainable = False
2017-10-14 16:47:30,026 - INFO - allennlp.common.params - model.text_field_embedder.tokens.padding_index = None
2017-10-14 16:47:30,026 - INFO - allennlp.common.params - model.text_field_embedder.tokens.max_norm = None
2017-10-14 16:47:30,026 - INFO - allennlp.common.params - model.text_field_embedder.tokens.norm_type = 2.0
2017-10-14 16:47:30,026 - INFO - allennlp.common.params - model.text_field_embedder.tokens.scale_grad_by_freq = False
2017-10-14 16:47:30,026 - INFO - allennlp.common.params - model.text_field_embedder.tokens.sparse = False
2017-10-14 16:47:30,028 - INFO - allennlp.modules.token_embedders.embedding - Reading embeddings from file
2017-10-14 16:47:49,396 - INFO - allennlp.modules.token_embedders.embedding - Initializing pre-trained embedding layer
2017-10-14 16:47:51,391 - INFO - allennlp.common.params - model.premise_encoder = None
2017-10-14 16:47:51,391 - INFO - allennlp.common.params - model.hypothesis_encoder = None
2017-10-14 16:47:51,392 - INFO - allennlp.common.params - model.attend_feedforward.input_dim = 200
2017-10-14 16:47:51,392 - INFO - allennlp.common.params - model.attend_feedforward.num_layers = 2
2017-10-14 16:47:51,392 - INFO - allennlp.common.params - model.attend_feedforward.hidden_dims = 200
2017-10-14 16:47:51,392 - INFO - allennlp.common.params - model.attend_feedforward.activations = relu
2017-10-14 16:47:51,392 - INFO - allennlp.common.params - model.attend_feedforward.dropout = 0.2
2017-10-14 16:47:51,394 - INFO - allennlp.common.params - model.similarity_function.type = dot_product
2017-10-14 16:47:51,394 - INFO - allennlp.common.params - model.similarity_function.scale_output = False
2017-10-14 16:47:51,394 - INFO - allennlp.common.params - model.compare_feedforward.input_dim = 400
2017-10-14 16:47:51,394 - INFO - allennlp.common.params - model.compare_feedforward.num_layers = 2
2017-10-14 16:47:51,394 - INFO - allennlp.common.params - model.compare_feedforward.hidden_dims = 200
2017-10-14 16:47:51,394 - INFO - allennlp.common.params - model.compare_feedforward.activations = relu
2017-10-14 16:47:51,394 - INFO - allennlp.common.params - model.compare_feedforward.dropout = 0.2
2017-10-14 16:47:51,396 - INFO - allennlp.common.params - model.aggregate_feedforward.input_dim = 400
2017-10-14 16:47:51,396 - INFO - allennlp.common.params - model.aggregate_feedforward.num_layers = 2
2017-10-14 16:47:51,396 - INFO - allennlp.common.params - model.aggregate_feedforward.hidden_dims = [200, 3]
2017-10-14 16:47:51,396 - INFO - allennlp.common.params - model.aggregate_feedforward.activations = ['relu', 'linear']
2017-10-14 16:47:51,397 - INFO - allennlp.common.params - model.aggregate_feedforward.dropout = [0.2, 0.0]
2017-10-14 16:47:51,398 - INFO - allennlp.common.params - model.initializer = [['.*linear_layers.*weight', ConfigTree([('type', 'xavier_normal')])], ['.*token_embedder_tokens\\._projection.*weight', ConfigTree([('type', 'xavier_normal')])]]
2017-10-14 16:47:51,398 - INFO - allennlp.common.params - model.regularizer = None
2017-10-14 16:47:51,398 - INFO - allennlp.common.params - model.initializer.list.list.type = xavier_normal
2017-10-14 16:47:51,398 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently.
2017-10-14 16:47:51,398 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS:
2017-10-14 16:47:51,398 - INFO - allennlp.common.params - model.initializer.list.list.type = xavier_normal
2017-10-14 16:47:51,399 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently.
2017-10-14 16:47:51,399 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS:
2017-10-14 16:47:51,399 - INFO - allennlp.nn.initializers - Initializing parameters
2017-10-14 16:47:51,400 - INFO - allennlp.nn.initializers - Initializing _text_field_embedder.token_embedder_tokens._projection.weight using .*token_embedder_tokens\._projection.*weight intitializer
2017-10-14 16:47:51,404 - INFO - allennlp.nn.initializers - Initializing _attend_feedforward._module._linear_layers.0.weight using .*linear_layers.*weight intitializer
2017-10-14 16:47:51,407 - INFO - allennlp.nn.initializers - Initializing _attend_feedforward._module._linear_layers.1.weight using .*linear_layers.*weight intitializer
2017-10-14 16:47:51,410 - INFO - allennlp.nn.initializers - Initializing _compare_feedforward._module._linear_layers.0.weight using .*linear_layers.*weight intitializer
2017-10-14 16:47:51,416 - INFO - allennlp.nn.initializers - Initializing _compare_feedforward._module._linear_layers.1.weight using .*linear_layers.*weight intitializer
2017-10-14 16:47:51,419 - INFO - allennlp.nn.initializers - Initializing _aggregate_feedforward._linear_layers.0.weight using .*linear_layers.*weight intitializer
2017-10-14 16:47:51,424 - INFO - allennlp.nn.initializers - Initializing _aggregate_feedforward._linear_layers.1.weight using .*linear_layers.*weight intitializer
2017-10-14 16:47:51,424 - INFO - allennlp.nn.initializers - Done initializing parameters; the following parameters are using their default initialization from their code
2017-10-14 16:47:51,425 - INFO - allennlp.nn.initializers -    _aggregate_feedforward._linear_layers.0.bias
2017-10-14 16:47:51,425 - INFO - allennlp.nn.initializers -    _aggregate_feedforward._linear_layers.1.bias
2017-10-14 16:47:51,425 - INFO - allennlp.nn.initializers -    _attend_feedforward._module._linear_layers.0.bias
2017-10-14 16:47:51,425 - INFO - allennlp.nn.initializers -    _attend_feedforward._module._linear_layers.1.bias
2017-10-14 16:47:51,425 - INFO - allennlp.nn.initializers -    _compare_feedforward._module._linear_layers.0.bias
2017-10-14 16:47:51,425 - INFO - allennlp.nn.initializers -    _compare_feedforward._module._linear_layers.1.bias
2017-10-14 16:47:51,425 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens._projection.bias
2017-10-14 16:47:51,425 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.weight
2017-10-14 16:47:51,426 - INFO - allennlp.common.params - iterator.type = bucket
2017-10-14 16:47:51,426 - INFO - allennlp.common.params - iterator.sorting_keys = [['premise', 'num_tokens'], ['hypothesis', 'num_tokens']]
2017-10-14 16:47:51,426 - INFO - allennlp.common.params - iterator.padding_noise = 0.1
2017-10-14 16:47:51,426 - INFO - allennlp.common.params - iterator.biggest_batch_first = False
2017-10-14 16:47:51,426 - INFO - allennlp.common.params - iterator.batch_size = 32
2017-10-14 16:47:51,426 - INFO - allennlp.data.dataset - Indexing dataset
2017-10-14 16:48:24,849 - INFO - allennlp.data.dataset - Indexing dataset
2017-10-14 16:48:25,447 - INFO - allennlp.common.params - trainer.patience = 20
2017-10-14 16:48:25,447 - INFO - allennlp.common.params - trainer.validation_metric = +accuracy
2017-10-14 16:48:25,447 - INFO - allennlp.common.params - trainer.num_epochs = 140
2017-10-14 16:48:25,447 - INFO - allennlp.common.params - trainer.cuda_device = 0
2017-10-14 16:48:25,447 - INFO - allennlp.common.params - trainer.grad_norm = None
2017-10-14 16:48:25,448 - INFO - allennlp.common.params - trainer.grad_clipping = None
2017-10-14 16:48:25,448 - INFO - allennlp.common.params - trainer.learning_rate_scheduler = None
2017-10-14 16:48:27,952 - INFO - allennlp.common.params - trainer.optimizer.type = adagrad
2017-10-14 16:48:27,953 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently.
2017-10-14 16:48:27,953 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS:
2017-10-14 16:48:27,954 - INFO - allennlp.common.params - trainer.no_tqdm = True
2017-10-14 16:48:27,960 - INFO - allennlp.common.params - evaluate_on_test = False
2017-10-14 16:48:27,962 - INFO - allennlp.training.trainer - Beginning training.
2017-10-14 16:48:27,962 - INFO - allennlp.training.trainer - Epoch 0/139
2017-10-14 16:48:27,962 - INFO - allennlp.training.trainer - Training
2017-10-14 16:48:48,950 - INFO - allennlp.training.trainer - Batch 1/17168: accuracy: 0.25, loss: 6.66 ||
2017-10-14 16:48:58,960 - INFO - allennlp.training.trainer - Batch 567/17168: accuracy: 0.42, loss: 1.32 ||
2017-10-14 16:49:08,969 - INFO - allennlp.training.trainer - Batch 1128/17168: accuracy: 0.46, loss: 1.17 ||
2017-10-14 16:49:18,982 - INFO - allennlp.training.trainer - Batch 1701/17168: accuracy: 0.49, loss: 1.09 ||
2017-10-14 16:49:28,988 - INFO - allennlp.training.trainer - Batch 2271/17168: accuracy: 0.51, loss: 1.05 ||
2017-10-14 16:49:39,003 - INFO - allennlp.training.trainer - Batch 2834/17168: accuracy: 0.52, loss: 1.02 ||
2017-10-14 16:49:49,010 - INFO - allennlp.training.trainer - Batch 3407/17168: accuracy: 0.54, loss: 1.00 ||
2017-10-14 16:49:59,022 - INFO - allennlp.training.trainer - Batch 3975/17168: accuracy: 0.54, loss: 0.98 ||
2017-10-14 16:50:09,038 - INFO - allennlp.training.trainer - Batch 4551/17168: accuracy: 0.55, loss: 0.97 ||
2017-10-14 16:50:19,041 - INFO - allennlp.training.trainer - Batch 5123/17168: accuracy: 0.56, loss: 0.96 ||
2017-10-14 16:50:29,051 - INFO - allennlp.training.trainer - Batch 5698/17168: accuracy: 0.56, loss: 0.95 ||
2017-10-14 16:50:39,054 - INFO - allennlp.training.trainer - Batch 6263/17168: accuracy: 0.57, loss: 0.94 ||
2017-10-14 16:50:49,069 - INFO - allennlp.training.trainer - Batch 6835/17168: accuracy: 0.57, loss: 0.93 ||
2017-10-14 16:50:59,079 - INFO - allennlp.training.trainer - Batch 7409/17168: accuracy: 0.57, loss: 0.93 ||
2017-10-14 16:51:09,087 - INFO - allennlp.training.trainer - Batch 7988/17168: accuracy: 0.57, loss: 0.92 ||
2017-10-14 16:51:19,090 - INFO - allennlp.training.trainer - Batch 8563/17168: accuracy: 0.58, loss: 0.92 ||
2017-10-14 16:51:29,099 - INFO - allennlp.training.trainer - Batch 9138/17168: accuracy: 0.58, loss: 0.91 ||
2017-10-14 16:51:39,104 - INFO - allennlp.training.trainer - Batch 9705/17168: accuracy: 0.58, loss: 0.91 ||
2017-10-14 16:51:49,115 - INFO - allennlp.training.trainer - Batch 10287/17168: accuracy: 0.58, loss: 0.90 ||
2017-10-14 16:51:59,123 - INFO - allennlp.training.trainer - Batch 10863/17168: accuracy: 0.59, loss: 0.90 ||
2017-10-14 16:52:09,135 - INFO - allennlp.training.trainer - Batch 11431/17168: accuracy: 0.59, loss: 0.90 ||
2017-10-14 16:52:19,153 - INFO - allennlp.training.trainer - Batch 12003/17168: accuracy: 0.59, loss: 0.89 ||
2017-10-14 16:52:29,169 - INFO - allennlp.training.trainer - Batch 12579/17168: accuracy: 0.59, loss: 0.89 ||
2017-10-14 16:52:39,171 - INFO - allennlp.training.trainer - Batch 13140/17168: accuracy: 0.59, loss: 0.89 ||
2017-10-14 16:52:49,179 - INFO - allennlp.training.trainer - Batch 13709/17168: accuracy: 0.59, loss: 0.89 ||
2017-10-14 16:52:59,181 - INFO - allennlp.training.trainer - Batch 14287/17168: accuracy: 0.59, loss: 0.88 ||
2017-10-14 16:53:09,191 - INFO - allennlp.training.trainer - Batch 14862/17168: accuracy: 0.60, loss: 0.88 ||
2017-10-14 16:53:19,194 - INFO - allennlp.training.trainer - Batch 15435/17168: accuracy: 0.60, loss: 0.88 ||
2017-10-14 16:53:29,202 - INFO - allennlp.training.trainer - Batch 16006/17168: accuracy: 0.60, loss: 0.88 ||
2017-10-14 16:53:39,205 - INFO - allennlp.training.trainer - Batch 16576/17168: accuracy: 0.60, loss: 0.87 ||
2017-10-14 16:53:49,213 - INFO - allennlp.training.trainer - Batch 17150/17168: accuracy: 0.60, loss: 0.87 ||
2017-10-14 16:53:49,547 - INFO - allennlp.training.trainer - Validating
2017-10-14 16:53:52,087 - INFO - allennlp.training.trainer - Best validation performance so far. Copying weights to /scratch/k/GloVe/snli/baseline_vocab3/best.th'.
2017-10-14 16:53:52,334 - INFO - allennlp.training.trainer - Training accuracy : 0.600775    Validation accuracy : 0.658200
2017-10-14 16:53:52,335 - INFO - allennlp.training.trainer - Training loss : 0.872066    Validation loss : 0.780600
2017-10-14 16:53:52,335 - INFO - allennlp.training.trainer - Epoch 1/139
2017-10-14 16:53:52,336 - INFO - allennlp.training.trainer - Training
...
...
...
2017-10-15 05:06:09,441 - INFO - allennlp.training.trainer - Epoch 139/139
2017-10-15 05:06:09,442 - INFO - allennlp.training.trainer - Training
2017-10-15 05:06:24,082 - INFO - allennlp.training.trainer - Batch 1/17168: accuracy: 0.88, loss: 0.45 ||
2017-10-15 05:06:34,095 - INFO - allennlp.training.trainer - Batch 565/17168: accuracy: 0.77, loss: 0.57 ||
2017-10-15 05:06:44,107 - INFO - allennlp.training.trainer - Batch 1136/17168: accuracy: 0.77, loss: 0.57 ||
2017-10-15 05:06:54,126 - INFO - allennlp.training.trainer - Batch 1708/17168: accuracy: 0.77, loss: 0.57 ||
2017-10-15 05:07:04,143 - INFO - allennlp.training.trainer - Batch 2281/17168: accuracy: 0.77, loss: 0.57 ||
2017-10-15 05:07:14,165 - INFO - allennlp.training.trainer - Batch 2854/17168: accuracy: 0.77, loss: 0.57 ||
2017-10-15 05:07:24,170 - INFO - allennlp.training.trainer - Batch 3431/17168: accuracy: 0.77, loss: 0.57 ||
2017-10-15 05:07:34,178 - INFO - allennlp.training.trainer - Batch 4000/17168: accuracy: 0.77, loss: 0.56 ||
2017-10-15 05:07:44,187 - INFO - allennlp.training.trainer - Batch 4567/17168: accuracy: 0.77, loss: 0.57 ||
2017-10-15 05:07:54,189 - INFO - allennlp.training.trainer - Batch 5140/17168: accuracy: 0.77, loss: 0.56 ||
2017-10-15 05:08:04,195 - INFO - allennlp.training.trainer - Batch 5708/17168: accuracy: 0.77, loss: 0.56 ||
2017-10-15 05:08:14,198 - INFO - allennlp.training.trainer - Batch 6279/17168: accuracy: 0.77, loss: 0.56 ||
2017-10-15 05:08:24,210 - INFO - allennlp.training.trainer - Batch 6850/17168: accuracy: 0.77, loss: 0.56 ||
2017-10-15 05:08:34,219 - INFO - allennlp.training.trainer - Batch 7427/17168: accuracy: 0.77, loss: 0.56 ||
2017-10-15 05:08:44,221 - INFO - allennlp.training.trainer - Batch 8007/17168: accuracy: 0.77, loss: 0.57 ||
2017-10-15 05:08:54,236 - INFO - allennlp.training.trainer - Batch 8584/17168: accuracy: 0.77, loss: 0.56 ||
2017-10-15 05:09:04,239 - INFO - allennlp.training.trainer - Batch 9153/17168: accuracy: 0.77, loss: 0.57 ||
2017-10-15 05:09:14,248 - INFO - allennlp.training.trainer - Batch 9729/17168: accuracy: 0.77, loss: 0.57 ||
2017-10-15 05:09:24,259 - INFO - allennlp.training.trainer - Batch 10303/17168: accuracy: 0.77, loss: 0.56 ||
2017-10-15 05:09:34,261 - INFO - allennlp.training.trainer - Batch 10876/17168: accuracy: 0.77, loss: 0.56 ||
2017-10-15 05:09:44,283 - INFO - allennlp.training.trainer - Batch 11448/17168: accuracy: 0.77, loss: 0.56 ||
2017-10-15 05:09:54,299 - INFO - allennlp.training.trainer - Batch 12018/17168: accuracy: 0.77, loss: 0.57 ||
2017-10-15 05:10:04,312 - INFO - allennlp.training.trainer - Batch 12589/17168: accuracy: 0.77, loss: 0.57 ||
2017-10-15 05:10:14,328 - INFO - allennlp.training.trainer - Batch 13159/17168: accuracy: 0.77, loss: 0.57 ||
2017-10-15 05:10:24,337 - INFO - allennlp.training.trainer - Batch 13733/17168: accuracy: 0.77, loss: 0.56 ||
2017-10-15 05:10:34,347 - INFO - allennlp.training.trainer - Batch 14303/17168: accuracy: 0.77, loss: 0.56 ||
2017-10-15 05:10:44,348 - INFO - allennlp.training.trainer - Batch 14879/17168: accuracy: 0.77, loss: 0.56 ||
2017-10-15 05:10:54,358 - INFO - allennlp.training.trainer - Batch 15457/17168: accuracy: 0.77, loss: 0.57 ||
2017-10-15 05:11:04,368 - INFO - allennlp.training.trainer - Batch 16028/17168: accuracy: 0.77, loss: 0.56 ||
2017-10-15 05:11:14,386 - INFO - allennlp.training.trainer - Batch 16603/17168: accuracy: 0.77, loss: 0.56 ||
2017-10-15 05:11:24,270 - INFO - allennlp.training.trainer - Validating
2017-10-15 05:11:24,528 - INFO - allennlp.training.trainer - Batch 1/308: accuracy: 0.78, loss: 0.52 ||
2017-10-15 05:11:26,935 - INFO - allennlp.training.trainer - Training accuracy : 0.770621    Validation accuracy : 0.786730
2017-10-15 05:11:26,936 - INFO - allennlp.training.trainer - Training loss : 0.564843    Validation loss : 0.532112
2017-10-15 05:11:26,937 - INFO - allennlp.models.archival - archiving weights and vocabulary to /scratch/k/GloVe/snli/baseline_vocab3/model.tar.gz
2017-10-15 05:11:29,549 - INFO - allennlp.commands.train - To evaluate on the test set after training, pass the 'evaluate_on_test' flag, or use the 'allennlp evaluate' command.
kellywzhang commented 7 years ago

For reference, here is the config file I used:

{
  "dataset_reader": {
    "type": "snli",
    "token_indexers": {
      "tokens": {
        "type": "single_id",
        "lowercase_tokens": true
      }
    },
    "tokenizer": {
      "end_tokens": ["@@NULL@@"]
    }
  },
  "train_data_path": "/scratch/k/Data/snli/snli_1.0/snli_1.0_train.jsonl",
  "validation_data_path": "/scratch/k/Data/snli/snli_1.0/snli_1.0_dev.jsonl",
  "test_data_path": "/scratch/k/Data/snli/snli_1.0/snli_1.0_test.jsonl",
  "model": {
    "type": "decomposable_attention",
    "text_field_embedder": {
      "tokens": {
        "type": "embedding",
        "projection_dim": 200,
        "pretrained_file": "/scratch/k/Data/glove.6B/glove.6B.300d.txt.gz",
        "embedding_dim": 300,
        "trainable": false
      }
    },
    "attend_feedforward": {
      "input_dim": 200,
      "num_layers": 2,
      "hidden_dims": 200,
      "activations": "relu",
      "dropout": 0.2
    },
    "similarity_function": {"type": "dot_product"},
    "compare_feedforward": {
      "input_dim": 400,
      "num_layers": 2,
      "hidden_dims": 200,
      "activations": "relu",
      "dropout": 0.2
    },
"aggregate_feedforward": {
      "input_dim": 400,
      "num_layers": 2,
      "hidden_dims": [200, 3],
      "activations": ["relu", "linear"],
      "dropout": [0.2, 0.0]
    },
     "initializer": [
      [".*linear_layers.*weight", {"type": "xavier_normal"}],
      [".*token_embedder_tokens\._projection.*weight", {"type": "xavier_normal"}]
     ]
   },
  "iterator": {
    "type": "bucket",
    "sorting_keys": [["premise", "num_tokens"], ["hypothesis", "num_tokens"]],
    "batch_size": 32
  },
  "trainer": {
    "num_epochs": 140,
    "patience": 20,
    "cuda_device": 0,
    "validation_metric": "+accuracy",
    "no_tqdm": true,
    "optimizer": {
      "type": "adagrad"
    }
  }
}
DeNeutoy commented 7 years ago

Ah. When I compared the config files, I missed the following two points: "grad_clipping": 5.0 in the trainer params, and using "batch_size": 64. Sorry, I hope that hasn't eaten up too much of your time. I posted here in case you needed to train it yourself, but i'll triple check that this solves the problems you were having by making a PR for this change after training a model tomorrow.

kellywzhang commented 7 years ago

Ah! Thank you. Those changes seem to be fixing the performance problems!

kellywzhang commented 7 years ago

Hello again! I've been training a semantic role labeling model with your parameters and so far I'm at epoch 23 (out of 500) and it's been 20 hours training on a p1080 gpu. Is it typical for training to be so long?

DeNeutoy commented 7 years ago

Hi Kelly - the model we released is very slow. We have since sped it up, but it's not yet particularly user friendly.

You can speed up the training(6-7x speedup) of the Allennlp model by replacing the LSTM in the config with this one: https://github.com/allenai/allennlp/blob/master/allennlp/modules/alternating_highway_lstm.py#L116 which uses a custom kernel to implement the interleaved LSTMs.

The config section you need for this looks like:

"stacked_encoder": {
    "type": "alternating_highway_lstm_cuda",
    "input_size": 200,
    "hidden_size": 300,
    "num_layers": 8,
    "recurrent_dropout_probability": 0.1
}

However, if you use this, please bear in mind:

Additionally, have you found it difficult to use our pretrained models? We released them precisely so researchers don't have to train them from scratch. If there was some sticking point, let us know.

schmmd commented 7 years ago

@kellywzhang I'm closing this issue as the original problem (performance of AllenNLP TE model) seems to be solved. Feel free to open additional issues for other problems.

Also, we recently updated the website with training commands at http://allennlp.org/models, although it sounds like you've already figured this out!