Closed Yichabod closed 4 years ago
Hi
Thanks for bringing this to my attention. I have updated the readme with instructions on using scibert. Basically you need to set a environment variable BERT_BASE_FOLDER with the path to your scibert download. The folder should contain two files - vocab.txt and weights.tar.gz (I have added link to this download in README). Also I have update some commands to make use of this correctly so please do a pull on repo before running again.
Thank you for correcting this so quickly! Following your instructions now does lead to training, but I'm encountering another error midway through training:
2020-08-05 21:25:19,505 - WARNING - allennlp.training.util - Metrics with names beginning with "_" will not be logged to the tqdm progress bar.
validation_metric: 24.1293, loss: 744.0142 ||: 0%| | 1/315 [00:05<30:56, 5.91svalidation_metric: 24.3066, loss: 655.7399 ||: 1%| | 2/315 [00:09<26:50, 5.15svalidation_metric: 24.3811, loss: 603.2519 ||: 1%| | 3/315 [00:14<26:53, 5.17svalidation_metric: 21.8290, loss: 567.4237 ||: 1%|1 | 4/315 [00:19<26:51, 5.18svalidation_metric: 21.7070, loss: 521.6448 ||: 2%|1 | 5/315 [00:23<24:40, 4.78svalidation_metric: 21.1453, loss: 483.5662 ||: 2%|1 | 6/315 [00:28<25:08, 4.88svalidation_metric: 20.2238, loss: 439.4254 ||: 2%|2 | 7/315 [00:33<24:24, 4.76svalidation_metric: 19.7650, loss: 414.9508 ||: 3%|2 | 8/315 [00:36<22:20, 4.37svalidation_metric: 19.6044, loss: 382.2387 ||: 3%|2 | 9/315 [00:42<25:10, 4.94svalidation_metric: 19.5697, loss: 357.3543 ||: 3%|3 | 10/315 [00:44<20:50, 4.10validation_metric: 19.2033, loss: 341.7199 ||: 3%|3 | 11/315 [00:48<19:55, 3.93validation_metric: 18.7423, loss: 329.7398 ||: 4%|3 | 12/315 [00:52<20:37, 4.09validation_metric: 18.4675, loss: 316.9455 ||: 4%|4 | 13/315 [00:57<20:38, 4.10validation_metric: 19.4792, loss: 304.6804 ||: 4%|4 | 14/315 [01:01<20:44, 4.13validation_metric: 20.3388, loss: 294.1885 ||: 5%|4 | 15/315 [01:05<20:34, 4.12s/it]Traceback (most recent call last):
File "/opt/conda/envs/scirex/bin/allennlp", line 8, in <module>
sys.exit(run())
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 102, in main
args.func(args)
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/commands/train.py", line 124, in train_model_from_args
args.cache_prefix)
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/commands/train.py", line 168, in train_model_from_file
cache_directory, cache_prefix)
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/commands/train.py", line 252, in train_model
metrics = trainer.train()
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/training/trainer.py", line 478, in train
train_metrics = self._train_epoch(epoch)
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/training/trainer.py", line 320, in _train_epoch
loss = self.batch_loss(batch_group, for_training=True)
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/training/trainer.py", line 261, in batch_loss
output_dict = self.model(**batch)
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "./scirex/models/scirex_model.py", line 90, in forward
output_embedding = self.embedding_forward(text)
File "./scirex/models/scirex_model.py", line 121, in embedding_forward
text_embeddings = self._lexical_dropout(self._text_field_embedder(text))
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 118, in forward
token_vectors = embedder(*tensors, **forward_params_values)
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/modules/token_embedders/bert_token_embedder.py", line 181, in forward
unpacked_embeddings = torch.cat(unpacked_embeddings, dim=2)
RuntimeError: CUDA out of memory. Tried to allocate 900.00 MiB (GPU 0; 11.17 GiB total capacity; 10.22 GiB already allocated; 261.81 MiB free; 10.55 GiB reserved in total by PyTorch)
Hi
Our model can only be trained on 48Gb GPUs since we apply bert on whole documents (>5000 words on average). You can try to reduce the batch size here https://github.com/allenai/SciREX/blob/eb9f6f31c94db94a2c68698c1c047e3140354da6/scirex/training_config/template_full.libsonnet#L98 but I can't say how good the results will be then.
I decreased the batch_size to 10 as you suggested, but now there is new bug (I would fix it on my own if it were more obvious and I really appreciate you helping out!)
validation_metric: 41.6282, loss: 85.2457 ||: 17%|#7 | 177/1036 [05:29<26:52, 1.8validation_metric: 41.7229, loss: 85.1601 ||: 17%|#7 | 178/1036 [05:31<28:01, 1.9validation_metric: 41.7066, loss: 84.7656 ||: 17%|#7 | 179/1036 [05:32<23:56, 1.68s/it]Traceback (most recent call last):
File "/opt/conda/envs/scirex/bin/allennlp", line 8, in <module>
sys.exit(run())
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 102, in main
args.func(args)
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/commands/train.py", line 124, in train_model_from_args
args.cache_prefix)
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/commands/train.py", line 168, in train_model_from_file
cache_directory, cache_prefix)
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/commands/train.py", line 252, in train_model
metrics = trainer.train()
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/training/trainer.py", line 478, in train
train_metrics = self._train_epoch(epoch)
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/training/trainer.py", line 313, in _train_epoch
for batch_group in train_generator_tqdm:
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/tqdm/_tqdm.py", line 1005, in __iter__
for obj in iterable:
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/common/util.py", line 104, in <lambda>
return iter(lambda: list(islice(iterator, 0, group_size)), [])
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/data/iterators/data_iterator.py", line 153, in __call__
tensor_dict = batch.as_tensor_dict(padding_lengths)
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/data/dataset.py", line 137, in as_tensor_dict
for field, tensors in instance.as_tensor_dict(lengths_to_use).items():
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/data/instance.py", line 97, in as_tensor_dict
tensors[field_name] = field.as_tensor(padding_lengths[field_name])
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/data/fields/list_field.py", line 93, in as_tensor
for field in padded_field_list]
File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/data/fields/list_field.py", line 93, in <listcomp>
for field in padded_field_list]
File "./scirex/data/dataset_readers/multi_label_field.py", line 125, in as_tensor
tensor.scatter_(0, torch.LongTensor(self._label_ids), 1)
RuntimeError: Expected index [4] to be smaller than self [3] apart from dimension 0 and to be smaller size than src [3]
Thank you once again!
I tried to vary the batch_size to 10, and then 1, and also varied the validation_iterator batch_size to match the iterator batch size, but am getting the same error as above. Any idea why?
I have updated the code to take care of this issue. Please let me know if you still face the same problem
This part is fixed. Thanks!
Thank you guys for releasing your code with the accompanying instructions! Unfortunately I'm having a fair bit of trouble trying to run your implementation locally. Here is relevant machine information: OS: "Debian GNU/Linux 9 (stretch)" (Running virtually via GCP) GPU: NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 (Running on K80)
After git cloning the repo, I followed the first four instructions:
and there was no problem.
(I'm going to run through how I resolved the first two main errors I encountered in case it is helpful to someone else) However, when running
CUDA_DEVICE=0 bash scirex/commands/train_scirex_model.sh main
, I got an error that there was no GPU specified. After a lot of googling, I realized that pytorch had to be downgraded in order to work with 10.1 (the default is 10.2) so I followed the pytorch installation command with the correct settings and the error went away. However, the next error I got was acupy.cuda.runtime.PointerAttributes
error and I realised that I also needed to downgrade cupy, so I didcupy-cuda101==7.30
and downgraded cupy to work with Cuda 10.1. That resolved the PointerAttributes error.Now, the error that I'm stuck with seems to be that I don't have
scibert_scivocab_uncased
installed. I went to the AllenNLP page, downloaded scibert_scivocab_uncased using wget and uncompressed it (so that now there is a folder called 'scibert_scivocab_uncased' inside the repo with weights.tar.gz and vocab.txt inside) but still the same error is happening. Here is the stack trace: