Support with CUDA 10.1?

Yichabod commented 4 years ago

Thank you guys for releasing your code with the accompanying instructions! Unfortunately I'm having a fair bit of trouble trying to run your implementation locally. Here is relevant machine information: OS: "Debian GNU/Linux 9 (stretch)" (Running virtually via GCP) GPU: NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 (Running on K80)

After git cloning the repo, I followed the first four instructions:

conda create -n scirex python=3.7
pip install -r requirements.txt
python -m spacy download en
tar -xvzf scirex_dataset/release_data.tar.gz --directory scirex_dataset

and there was no problem.

(I'm going to run through how I resolved the first two main errors I encountered in case it is helpful to someone else) However, when running CUDA_DEVICE=0 bash scirex/commands/train_scirex_model.sh main, I got an error that there was no GPU specified. After a lot of googling, I realized that pytorch had to be downgraded in order to work with 10.1 (the default is 10.2) so I followed the pytorch installation command with the correct settings and the error went away. However, the next error I got was a cupy.cuda.runtime.PointerAttributes error and I realised that I also needed to downgrade cupy, so I did cupy-cuda101==7.30 and downgraded cupy to work with Cuda 10.1. That resolved the PointerAttributes error.

Now, the error that I'm stuck with seems to be that I don't have scibert_scivocab_uncased installed. I went to the AllenNLP page, downloaded scibert_scivocab_uncased using wget and uncompressed it (so that now there is a folder called 'scibert_scivocab_uncased' inside the repo with weights.tar.gz and vocab.txt inside) but still the same error is happening. Here is the stack trace:


2020-08-05 15:18:57,835 - INFO - pytorch_pretrained_bert.modeling - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
2020-08-05 15:18:58,459 - INFO - pytorch_transformers.modeling_bert - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
2020-08-05 15:18:58,463 - INFO - pytorch_transformers.modeling_xlnet - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
2020-08-05 15:18:58,667 - INFO - allennlp.common.registrable - instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
2020-08-05 15:18:58,668 - INFO - allennlp.common.registrable - instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
2020-08-05 15:18:58,669 - INFO - allennlp.common.registrable - instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
2020-08-05 15:18:58,669 - INFO - allennlp.common.registrable - instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
2020-08-05 15:18:59,535 - INFO - allennlp.common.params - random_seed = 13370
2020-08-05 15:18:59,535 - INFO - allennlp.common.params - numpy_seed = 1337
2020-08-05 15:18:59,535 - INFO - allennlp.common.params - pytorch_seed = 133
2020-08-05 15:19:01,499 - INFO - allennlp.common.checks - Pytorch version: 1.6.0+cu101
2020-08-05 15:19:01,500 - INFO - allennlp.common.params - evaluate_on_test = True
2020-08-05 15:19:01,500 - INFO - allennlp.common.params - validation_dataset_reader = None
2020-08-05 15:19:01,500 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.dataset_readers.dataset_reader.DatasetReader'> from params {'token_indexers': {'bert': {'do_lowercase': 'true', 'pretrained_model': '/scivocab_uncased.vocab', 'truncate_long_sequences': False, 'type': 'bert-pretrained', 'use_starting_offsets': True}}, 'type': 'scirex_full_reader'} and extras set()
2020-08-05 15:19:01,501 - INFO - allennlp.common.params - dataset_reader.type = scirex_full_reader
2020-08-05 15:19:01,501 - INFO - allennlp.common.from_params - instantiating class <class 'scirex.data.dataset_readers.scirex_full_reader.ScirexFullReader'> from params {'token_indexers': {'bert': {'do_lowercase': 'true', 'pretrained_model': '/scivocab_uncased.vocab', 'truncate_long_sequences': False, 'type': 'bert-pretrained', 'use_starting_offsets': True}}} and extras set()
2020-08-05 15:19:01,501 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.token_indexers.token_indexer.TokenIndexer'> from params {'do_lowercase': 'true', 'pretrained_model': '/scivocab_uncased.vocab', 'truncate_long_sequences': False, 'type': 'bert-pretrained', 'use_starting_offsets': True} and extras set()
2020-08-05 15:19:01,501 - INFO - allennlp.common.params - dataset_reader.token_indexers.bert.type = bert-pretrained
2020-08-05 15:19:01,502 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.token_indexers.wordpiece_indexer.PretrainedBertIndexer'> from params {'do_lowercase': 'true', 'pretrained_model': '/scivocab_uncased.vocab', 'truncate_long_sequences': False, 'use_starting_offsets': True} and extras set()
2020-08-05 15:19:01,502 - INFO - allennlp.common.params - dataset_reader.token_indexers.bert.pretrained_model = /scivocab_uncased.vocab
2020-08-05 15:19:01,502 - INFO - allennlp.common.params - dataset_reader.token_indexers.bert.use_starting_offsets = True
2020-08-05 15:19:01,502 - INFO - allennlp.common.params - dataset_reader.token_indexers.bert.do_lowercase = true
2020-08-05 15:19:01,502 - INFO - allennlp.common.params - dataset_reader.token_indexers.bert.never_lowercase = None
2020-08-05 15:19:01,503 - INFO - allennlp.common.params - dataset_reader.token_indexers.bert.max_pieces = 512
2020-08-05 15:19:01,503 - INFO - allennlp.common.params - dataset_reader.token_indexers.bert.truncate_long_sequences = False
2020-08-05 15:19:01,503 - ERROR - pytorch_pretrained_bert.tokenization - Model name '/scivocab_uncased.vocab' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese). We assumed '/scivocab_uncased.vocab' was a path or url but couldn't find any file associated to this path or url.
Traceback (most recent call last):
  File "/opt/conda/envs/scirex/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 102, in main
    args.func(args)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/commands/train.py", line 124, in train_model_from_args
    args.cache_prefix)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/commands/train.py", line 168, in train_model_from_file
    cache_directory, cache_prefix)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/commands/train.py", line 226, in train_model
    cache_prefix)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/training/trainer_pieces.py", line 42, in from_params
    all_datasets = training_util.datasets_from_params(params, cache_directory, cache_prefix)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/training/util.py", line 165, in datasets_from_params
    dataset_reader = DatasetReader.from_params(dataset_reader_params)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/common/from_params.py", line 365, in from_params
    return subclass.from_params(params=params, **extras)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/common/from_params.py", line 386, in from_params
    kwargs = create_kwargs(cls, params, **extras)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/common/from_params.py", line 133, in create_kwargs
    kwargs[name] = construct_arg(cls, name, annotation, param.default, params, **extras)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/common/from_params.py", line 257, in construct_arg
    value_dict[key] = value_cls.from_params(params=value_params, **subextras)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/common/from_params.py", line 365, in from_params
    return subclass.from_params(params=params, **extras)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/common/from_params.py", line 388, in from_params
    return cls(**kwargs)  # type: ignore
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/data/token_indexers/wordpiece_indexer.py", line 345, in __init__
    super().__init__(vocab=bert_tokenizer.vocab,
AttributeError: 'NoneType' object has no attribute 'vocab'```

successar commented 4 years ago

Hi

Thanks for bringing this to my attention. I have updated the readme with instructions on using scibert. Basically you need to set a environment variable BERT_BASE_FOLDER with the path to your scibert download. The folder should contain two files - vocab.txt and weights.tar.gz (I have added link to this download in README). Also I have update some commands to make use of this correctly so please do a pull on repo before running again.

Yichabod commented 4 years ago

Thank you for correcting this so quickly! Following your instructions now does lead to training, but I'm encountering another error midway through training:

2020-08-05 21:25:19,505 - WARNING - allennlp.training.util - Metrics with names beginning with "_" will not be logged to the tqdm progress bar.
validation_metric: 24.1293, loss: 744.0142 ||:   0%|          | 1/315 [00:05<30:56,  5.91svalidation_metric: 24.3066, loss: 655.7399 ||:   1%|          | 2/315 [00:09<26:50,  5.15svalidation_metric: 24.3811, loss: 603.2519 ||:   1%|          | 3/315 [00:14<26:53,  5.17svalidation_metric: 21.8290, loss: 567.4237 ||:   1%|1         | 4/315 [00:19<26:51,  5.18svalidation_metric: 21.7070, loss: 521.6448 ||:   2%|1         | 5/315 [00:23<24:40,  4.78svalidation_metric: 21.1453, loss: 483.5662 ||:   2%|1         | 6/315 [00:28<25:08,  4.88svalidation_metric: 20.2238, loss: 439.4254 ||:   2%|2         | 7/315 [00:33<24:24,  4.76svalidation_metric: 19.7650, loss: 414.9508 ||:   3%|2         | 8/315 [00:36<22:20,  4.37svalidation_metric: 19.6044, loss: 382.2387 ||:   3%|2         | 9/315 [00:42<25:10,  4.94svalidation_metric: 19.5697, loss: 357.3543 ||:   3%|3         | 10/315 [00:44<20:50,  4.10validation_metric: 19.2033, loss: 341.7199 ||:   3%|3         | 11/315 [00:48<19:55,  3.93validation_metric: 18.7423, loss: 329.7398 ||:   4%|3         | 12/315 [00:52<20:37,  4.09validation_metric: 18.4675, loss: 316.9455 ||:   4%|4         | 13/315 [00:57<20:38,  4.10validation_metric: 19.4792, loss: 304.6804 ||:   4%|4         | 14/315 [01:01<20:44,  4.13validation_metric: 20.3388, loss: 294.1885 ||:   5%|4         | 15/315 [01:05<20:34,  4.12s/it]Traceback (most recent call last):
  File "/opt/conda/envs/scirex/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 102, in main
    args.func(args)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/commands/train.py", line 124, in train_model_from_args
    args.cache_prefix)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/commands/train.py", line 168, in train_model_from_file
    cache_directory, cache_prefix)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/commands/train.py", line 252, in train_model
    metrics = trainer.train()
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/training/trainer.py", line 478, in train
    train_metrics = self._train_epoch(epoch)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/training/trainer.py", line 320, in _train_epoch
    loss = self.batch_loss(batch_group, for_training=True)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/training/trainer.py", line 261, in batch_loss
    output_dict = self.model(**batch)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "./scirex/models/scirex_model.py", line 90, in forward
    output_embedding = self.embedding_forward(text)
  File "./scirex/models/scirex_model.py", line 121, in embedding_forward
    text_embeddings = self._lexical_dropout(self._text_field_embedder(text))
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 118, in forward
    token_vectors = embedder(*tensors, **forward_params_values)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/modules/token_embedders/bert_token_embedder.py", line 181, in forward
    unpacked_embeddings = torch.cat(unpacked_embeddings, dim=2)
RuntimeError: CUDA out of memory. Tried to allocate 900.00 MiB (GPU 0; 11.17 GiB total capacity; 10.22 GiB already allocated; 261.81 MiB free; 10.55 GiB reserved in total by PyTorch)

successar commented 4 years ago

Hi

Our model can only be trained on 48Gb GPUs since we apply bert on whole documents (>5000 words on average). You can try to reduce the batch size here https://github.com/allenai/SciREX/blob/eb9f6f31c94db94a2c68698c1c047e3140354da6/scirex/training_config/template_full.libsonnet#L98 but I can't say how good the results will be then.

Yichabod commented 4 years ago

I decreased the batch_size to 10 as you suggested, but now there is new bug (I would fix it on my own if it were more obvious and I really appreciate you helping out!)

validation_metric: 41.6282, loss: 85.2457 ||:  17%|#7        | 177/1036 [05:29<26:52,  1.8validation_metric: 41.7229, loss: 85.1601 ||:  17%|#7        | 178/1036 [05:31<28:01,  1.9validation_metric: 41.7066, loss: 84.7656 ||:  17%|#7        | 179/1036 [05:32<23:56,  1.68s/it]Traceback (most recent call last):
  File "/opt/conda/envs/scirex/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 102, in main
    args.func(args)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/commands/train.py", line 124, in train_model_from_args
    args.cache_prefix)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/commands/train.py", line 168, in train_model_from_file
    cache_directory, cache_prefix)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/commands/train.py", line 252, in train_model
    metrics = trainer.train()
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/training/trainer.py", line 478, in train
    train_metrics = self._train_epoch(epoch)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/training/trainer.py", line 313, in _train_epoch
    for batch_group in train_generator_tqdm:
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/tqdm/_tqdm.py", line 1005, in __iter__
    for obj in iterable:
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/common/util.py", line 104, in <lambda>
    return iter(lambda: list(islice(iterator, 0, group_size)), [])
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/data/iterators/data_iterator.py", line 153, in __call__
    tensor_dict = batch.as_tensor_dict(padding_lengths)
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/data/dataset.py", line 137, in as_tensor_dict
    for field, tensors in instance.as_tensor_dict(lengths_to_use).items():
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/data/instance.py", line 97, in as_tensor_dict
    tensors[field_name] = field.as_tensor(padding_lengths[field_name])
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/data/fields/list_field.py", line 93, in as_tensor
    for field in padded_field_list]
  File "/opt/conda/envs/scirex/lib/python3.7/site-packages/allennlp/data/fields/list_field.py", line 93, in <listcomp>
    for field in padded_field_list]
  File "./scirex/data/dataset_readers/multi_label_field.py", line 125, in as_tensor
    tensor.scatter_(0, torch.LongTensor(self._label_ids), 1)
RuntimeError: Expected index [4] to be smaller than self [3] apart from dimension 0 and to be smaller size than src [3]

Thank you once again!

Yichabod commented 4 years ago

I tried to vary the batch_size to 10, and then 1, and also varied the validation_iterator batch_size to match the iterator batch size, but am getting the same error as above. Any idea why?

successar commented 4 years ago

I have updated the code to take care of this issue. Please let me know if you still face the same problem

Yichabod commented 4 years ago

This part is fixed. Thanks!

allenai / SciREX

Support with CUDA 10.1? #10