Open abdkiwan opened 3 years ago
I haven't seen this error, can you post a full traceback?
Hello,
I could solve the problem partially by increasing the limit of the open files using: ulimit -n NEW_NUMBER_OF_FILES After doing this, the training could run for a few hours, then I received another error and it crashed. Here is the full traceback:
Traceback (most recent call last): File "/home/IAIS/akiwan/relation-extraction/kb/kb/multitask.py", line 135, in call batch = next(generators[index]) StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/IAIS/akiwan/anaconda3/envs/knowbert/bin/allennlp", line 8, in
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/IAIS/akiwan/anaconda3/envs/knowbert/bin/allennlp", line 8, in
Can you post your config? Are you using the multitask_iterator
with only a single dataset / task? If you have only a single dataset then you can use a standard iterator without the multitask wrapper.
Hello Peter,
Here is the configuration file (DatasetReader & Iterator):
"dataset_reader": { "type": "multitask_reader", "datasets_for_vocab_creation": [], "dataset_readers": { "language_modeling": { "type": "multiprocess", "base_reader": { "type": "bert_pre_training", "tokenizer_and_candidate_generator": { "type": "bert_tokenizer_and_candidate_generator", "entity_candidate_generators": { "wiki": {"type": "wiki"}, }, "entity_indexers": { "wiki": { "type": "characters_tokenizer", "tokenizer": { "type": "word", "word_splitter": {"type": "just_spaces"}, }, "namespace": "entity" } }, "bert_model_type": "bert-base-uncased", "do_lower_case": true, }, "lazy": true, "mask_candidate_strategy": "full_mask", }, "num_workers": 8, }, } },
"iterator": {
"type": "multitask_iterator",
"names_to_index": ["language_modeling"],
"iterate_forever": true,
"sampling_rates": [1],
"iterators": {
"language_modeling": {
"type": "multiprocess",
"base_iterator": {
"type": "self_attn_bucket",
"batch_size_schedule": "base-24gb-fp32",
"iterator": {
"type": "bucket",
"batch_size": 8,
"sorting_keys": [["tokens", "num_tokens"]],
"max_instances_in_memory": 2500,
}
},
"num_workers": 8,
},
},
},
It isn't necessary to use the multitask_iterator
with only one task. Try replacing the iterator
section with this:
"iterator": {
"type": "multiprocess",
"base_iterator": {
"type": "self_attn_bucket",
"batch_size_schedule": "base-24gb-fp32",
"iterator": {
"type": "bucket",
"batch_size": 8,
"sorting_keys": [["tokens", "num_tokens"]],
"max_instances_in_memory": 2500,
}
},
"num_workers": 8,
},
I did exactly what you suggested. However, the model didn't go into training at all. Here are the info messages:
2020-10-15 01:30:33,083 - INFO - allennlp.training.trainer - Beginning training. 2020-10-15 01:30:33,083 - INFO - allennlp.training.trainer - Epoch 0/0 2020-10-15 01:30:33,083 - INFO - allennlp.training.trainer - Peak CPU memory usage MB: 8793.632 2020-10-15 01:30:33,499 - INFO - allennlp.training.trainer - GPU 0 memory usage MB: 11 2020-10-15 01:30:33,499 - INFO - allennlp.training.trainer - GPU 1 memory usage MB: 1146 2020-10-15 01:30:33,500 - INFO - allennlp.training.trainer - GPU 2 memory usage MB: 11841 2020-10-15 01:30:33,500 - INFO - allennlp.training.trainer - GPU 3 memory usage MB: 11 2020-10-15 01:30:33,506 - INFO - allennlp.training.trainer - Training 0%| | 0/1 [00:00<?, ?it/s] 0%| | 0/1 [00:00<?, ?it/s]
2020-10-15 01:30:34,213 - INFO - allennlp.training.tensorboard_writer - Training | Validation 2020-10-15 01:30:34,214 - INFO - allennlp.training.tensorboard_writer - gpu_0_memory_MB | 11.000 | N/A 2020-10-15 01:30:34,215 - INFO - allennlp.training.tensorboard_writer - wiki_el_precision | 0.000 | N/A 2020-10-15 01:30:34,216 - INFO - allennlp.training.tensorboard_writer - cpu_memory_MB | 8793.632 | N/A 2020-10-15 01:30:34,217 - INFO - allennlp.training.tensorboard_writer - nsp_loss_ema | 0.000 | N/A 2020-10-15 01:30:34,217 - INFO - allennlp.training.tensorboard_writer - lm_loss_wgt | 0.000 | N/A 2020-10-15 01:30:34,218 - INFO - allennlp.training.tensorboard_writer - wiki_el_f1 | 0.000 | N/A 2020-10-15 01:30:34,219 - INFO - allennlp.training.tensorboard_writer - wiki_span_f1 | 0.000 | N/A 2020-10-15 01:30:34,219 - INFO - allennlp.training.tensorboard_writer - gpu_1_memory_MB | 1146.000 | N/A 2020-10-15 01:30:34,220 - INFO - allennlp.training.tensorboard_writer - nsp_loss | 0.000 | N/A 2020-10-15 01:30:34,221 - INFO - allennlp.training.tensorboard_writer - gpu_2_memory_MB | 11841.000 | N/A 2020-10-15 01:30:34,221 - INFO - allennlp.training.tensorboard_writer - total_loss | 0.000 | N/A 2020-10-15 01:30:34,222 - INFO - allennlp.training.tensorboard_writer - lm_loss_ema | 0.000 | N/A 2020-10-15 01:30:34,223 - INFO - allennlp.training.tensorboard_writer - wiki_span_precision | 0.000 | N/A 2020-10-15 01:30:34,223 - INFO - allennlp.training.tensorboard_writer - lm_loss | 0.000 | N/A 2020-10-15 01:30:34,223 - INFO - allennlp.training.tensorboard_writer - wiki_span_recall | 0.000 | N/A 2020-10-15 01:30:34,224 - INFO - allennlp.training.tensorboard_writer - nsp_accuracy | 0.000 | N/A 2020-10-15 01:30:34,224 - INFO - allennlp.training.tensorboard_writer - gpu_3_memory_MB | 11.000 | N/A 2020-10-15 01:30:34,225 - INFO - allennlp.training.tensorboard_writer - total_loss_ema | 0.000 | N/A 2020-10-15 01:30:34,225 - INFO - allennlp.training.tensorboard_writer - mrr | 0.000 | N/A 2020-10-15 01:30:34,226 - INFO - allennlp.training.tensorboard_writer - loss | 0.000 | N/A 2020-10-15 01:30:34,226 - INFO - allennlp.training.tensorboard_writer - wiki_el_recall | 0.000 | N/A 2020-10-15 01:30:39,877 - INFO - allennlp.training.checkpointer - Best validation performance so far. Copying weights to 'knowbert_bert_rxnorm_lm_corpus_50_epochs_1/best.th'. 2020-10-15 01:30:45,483 - INFO - allennlp.training.trainer - Epoch duration: 00:00:12 2020-10-15 01:30:45,484 - INFO - allennlp.training.checkpointer - loading best weights 2020-10-15 01:30:45,830 - INFO - allennlp.models.archival - archiving weights and vocabulary to knowbert_bert_rxnorm_lm_corpus_50_epochs_1/model.tar.gz
Hello,
Could you plz tell me the reason of this error, and how to solve it ? It appears shortly after starting to pre-train a language model, then it stops.
Number of workers: 8 Number of corpus files: 8 torch version: 1.2.0
thanks for help