Kernel only works on GPU. Add support for CPU

okpatil4u commented 4 years ago

How much difficult would it be to add support to CPU ? I can contribute if you provide guideline.

ibeltagy commented 4 years ago

Thanks @okpatil4u for offering to help. Getting it to work on CPU is pretty straightforward, but making it fast is more involved. Here are the steps to get the CPU code to work:

change device here to 'cpu'
replace the schedule here with an empty schedule (remove lines 90-105 because they are GPU-specific)
follow scripts/cheatsheet.txt to run TVM docker container and compile the kernel
check longformer/lib/ directory to confirm the binaries were generated
add a unit test to confirm that the GPU output is the same as the CPU output

Now the more involved part; parallelizing the computation and making it fast. TVM has this nice tutorial that explains the TVM syntax for splitting a CPU computation into multiple smaller parallel jobs. I think the schedule that TVM implemented for batched_matmul here might work well for our kernel, but it will require a few modifications to work (need to support a different input format). So between the tutorial and the batched_matmul schedule, you can write something that is fast enough.

As I said, the second part is more involved, so let's start with the first part first and leave speeding it up to another PR.

okpatil4u commented 4 years ago

Thank you. I will give it a try.

bratao commented 4 years ago

I'll be rooting for you @okpatil4u 🙏

ibeltagy commented 4 years ago

@okpatil4u @bratao, we just added a PyToch implementation of the sliding window attention that doesn't need the custom CUDA kernel (https://github.com/allenai/longformer/pull/27). Please give it a try and let me know if we still need this.

Akshayextreme commented 4 years ago

@ibeltagy will these lines 1 2 3 from trviaqa script create issues while running on CPU?

parser.add_argument("--gpus", type=str, default='0', help="Comma separated list of gpus")
args.gpus = [int(x) for x in args.gpus.split(',')]
trainer = pl.Trainer(gpus=args.gpus, distributed_backend='ddp' if len(args.gpus) > 1 else None

ibeltagy commented 4 years ago

Yes, it won't work, but it is just a config in the script. Please try

trainer = pl.Trainer(gpus=None, distributed_backend=None,

Akshayextreme commented 4 years ago

I tried it already, but then I ran into below error

INFO:root:model and trainer restored from checkpoint: /content/longformer/triviaqa-longformer-large/checkpoints/_ckpt_epoch_4_v2.ckpt Testing: 0% 0/1 [00:00<?, ?batch/s]THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=38 : no CUDA-capable device is detected

I am not sure why it is doing THCudaCheck

ibeltagy commented 4 years ago

@Akshayextreme, I updated the script to use cpu, try the command line params: --gpus "" --fp32. It seems to work fine and I didn't get the THCudaCheck error you mentioned. Can you post the full error log?

Akshayextreme commented 4 years ago

Here is complete error log. I have used the updated scripts.

Query : python -m triviaqa --save_dir /content/longformer --train_dataset /content/longformer/try-test-wikipedia.json --dev_dataset /content/longformer/try-test-wikipedia.json --gpus "" --num_workers 4 --max_seq_len 4096 --doc_stride -1 --save_prefix triviaqa-longformer-large --model_path /content/longformer/longformer-large-4096 --test --fp32

Logs :

2020-04-30 05:55:33.836144: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json from cache at /root/.cache/torch/transformers/d0c5776499adc1ded22493fae699da0971c1ee4c2587111707a4d177d20257a2.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt from cache at /root/.cache/torch/transformers/b35e7cd126cd4229a746b5d5c29a749e8e84438b14bcdb575950584fe33207e8.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda INFO:transformers.configuration_utils:loading configuration file /content/longformer/longformer-large-4096/config.json INFO:transformers.configuration_utils:Model config { "attention_dilation": [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "attention_mode": "tvm", "attention_probs_dropout_prob": 0.1, "attention_window": [ 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256 ], "autoregressive": false, "finetuning_task": null, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 1024, "ignore_attention_mask": false, "initializer_range": 0.02, "intermediate_size": 4096, "layer_norm_eps": 1e-05, "max_position_embeddings": 4098, "num_attention_heads": 16, "num_hidden_layers": 24, "num_labels": 2, "output_attentions": false, "output_hidden_states": false, "pruned_heads": {}, "torchscript": false, "type_vocab_size": 1, "use_bfloat16": false, "vocab_size": 50265 }

INFO:transformers.modeling_utils:loading weights file /content/longformer/longformer-large-4096/pytorch_model.bin Loaded model with config: { "attention_dilation": [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "attention_mode": "tvm", "attention_probs_dropout_prob": 0.1, "attention_window": [ 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256 ], "autoregressive": false, "finetuning_task": null, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 1024, "ignore_attention_mask": false, "initializer_range": 0.02, "intermediate_size": 4096, "layer_norm_eps": 1e-05, "max_position_embeddings": 4098, "num_attention_heads": 16, "num_hidden_layers": 24, "num_labels": 2, "output_attentions": false, "output_hidden_states": false, "pruned_heads": {}, "torchscript": false, "type_vocab_size": 1, "use_bfloat16": false, "vocab_size": 50265 }

/usr/local/lib/python3.6/dist-packages/pytorch_lightning/callbacks/pt_callbacks.py:224: UserWarning: Checkpoint directory /content/longformer/triviaqa-longformer-large/checkpoints exists and is not empty with save_top_k != 0.All files in this directory will be deleted when a checkpoint is saved! f"Checkpoint directory {filepath} exists and is not empty with save_top_k != 0." Namespace(attention_mode='sliding_chunks', attention_window=256, batch_size=8, dev_dataset='/content/longformer/try-test-wikipedia.json', disable_checkpointing=False, doc_stride=-1, epochs=30, fp32=True, gpus=None, ignore_seq_with_no_answers=False, lr=0.0001, max_answer_length=30, max_doc_len=4096, max_num_answers=64, max_question_len=55, max_seq_len=4096, model_path='/content/longformer/longformer-large-4096', n_best_size=20, no_progress_bar=False, num_workers=4, regular_softmax_loss=False, save_dir='/content/longformer', save_prefix='triviaqa-longformer-large', seed=1234, test=True, train_dataset='/content/longformer/try-test-wikipedia.json', val_every=0.2, val_percent_check=1.0, warmup=200)

steps: 414930.0, #epochs: 30, batch_size: 8 <<<<<<<

/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:82: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule.See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) reading file: /content/longformer/try-test-wikipedia.json done reading file: /content/longformer/try-test-wikipedia.json reading file: /content/longformer/try-test-wikipedia.json done reading file: /content/longformer/try-test-wikipedia.json reading file: /content/longformer/try-test-wikipedia.json done reading file: /content/longformer/try-test-wikipedia.json INFO:root: Name ... Params 0 model ... 434 M 1 model.embeddings ... 55 M 2 model.embeddings.word_embeddings ... 51 M 3 model.embeddings.position_embeddings ... 4 M 4 model.embeddings.token_type_embeddings ... 1 K .. ... ... ... 464 model.encoder.layer.23.output.dropout ... 0
465 model.pooler ... 1 M 466 model.pooler.dense ... 1 M 467 model.pooler.activation ... 0
468 qa_outputs ... 2 K

[469 rows x 3 columns] INFO:root:model and trainer restored from checkpoint: /content/longformer/triviaqa-longformer-large/checkpoints/_ckpt_epoch_4_v2.ckpt Testing: 0% 0/1 [00:00<?, ?batch/s]THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=38 : no CUDA-capable device is detected Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/content/longformer/triviaqa.py", line 704, in main(args) File "/content/longformer/triviaqa.py", line 697, in main trainer.test(model) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 857, in test self.fit(model) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 707, in fit self.run_pretrain_routine(model) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 790, in run_pretrain_routine self.run_evaluation(test=True) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 305, in run_evaluation test) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 234, in evaluate test) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 363, in evaluation_forward output = model.test_step(args) File "/content/longformer/triviaqa.py", line 505, in test_step output = self.forward(input_ids, input_mask, segment_ids, subword_starts, subword_ends) File "/content/longformer/triviaqa.py", line 298, in forward attention_mask=attention_mask)[0] File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, kwargs) File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_roberta.py", line 177, in forward head_mask=head_mask) File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 625, in forward head_mask=head_mask) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 346, in forward layer_outputs = layer_module(hidden_states, attention_mask, head_mask[i]) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, kwargs) File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 324, in forward attention_outputs = self.attention(hidden_states, attention_mask, head_mask) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 281, in forward self_outputs = self.self(input_tensor, attention_mask, head_mask) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, **kwargs) File "/content/longformer/longformer/longformer.py", line 140, in forward attn_weights = sliding_chunks_matmul_qk(q, k, self.attention_window, padding_value=0) File "/content/longformer/longformer/sliding_chunks.py", line 84, in sliding_chunks_matmul_qk mask_invalid_locations(diagonal_attn, w, 1, False) File "/content/longformer/longformer/diagonaled_mm_tvm.py", line 316, in mask_invalid_locations affected_seq_len, beginning_mask, ending_mask = _get_invalid_locations_mask(w, d, autoregressive, input_tensor.device) File "/content/longformer/longformer/diagonaled_mm_tvm.py", line 300, in _get_invalid_locations_mask mask = _get_invalid_locations_mask_fixed_dilation(affected_seq_len, w, d) File "/content/longformer/longformer/diagonaled_mm_tvm.py", line 294, in _get_invalid_locations_mask_fixed_dilation return torch.stack(diagonals_list, dim=-1).cuda() File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 179, in _lazy_init torch._C._cuda_init() RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:50 Testing: 0%| | 0/1 [00:03<?, ?batch/s]

ibeltagy commented 4 years ago

Fixed. Can you check again?

Akshayextreme commented 4 years ago

It worked! Thanks! I suggest to update cheatsheet to run pretrained TriviaQA large model with cpu

Adrian-1234 commented 4 years ago

Hi,

Regarding the https://github.com/allenai/longformer 3. Run the model

The example given. How might I run this small test case to run on CPU's ?

Many thanks in advance.

ibeltagy commented 4 years ago

@Adrian-1234, maybe I am missing something but the example in the Readme already runs on CPU.

Adrian-1234 commented 4 years ago

Hi,
I get no o/p from the test (apart from the warnings below):

$ python3 y.py (y.py is exactly as per the example given) /home/adrian/.local/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/adrian/.local/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/adrian/.local/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/adrian/.local/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/adrian/.local/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/adrian/.local/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) $

I was expecting it to print "Hello world!" once ? I have pip installed the requirements.

Printing output and attention_mask I get: tensor([[[-0.0487, -0.0083, 0.0357, ..., -0.0348, -0.0800, -0.0212], [-0.1541, 0.2812, 0.2079, ..., 0.3218, 0.0356, 0.0424], [-0.0806, 0.0276, 0.1017, ..., -0.3952, -0.0781, 0.3135], ..., [-0.0236, 0.0741, -0.0145, ..., -0.0990, -0.0409, -0.0745], [-0.0236, 0.0741, -0.0145, ..., -0.0990, -0.0409, -0.0745], [-0.0236, 0.0741, -0.0145, ..., -0.0990, -0.0409, -0.0745]]], grad_fn=) tensor([[1, 2, 1, ..., 0, 0, 0]])

Thanks.

ibeltagy commented 4 years ago

I haven't seen the warning before but it looks like a known issue and it is discussed here.

The output and attention_mask are tensors of numbers as you get, not the string "Hello world!"

Adrian-1234 commented 4 years ago

Ok, Thanks. So they code is working correctly on CPUs then.

ibeltagy commented 4 years ago

Won't fix now that we have the sliding_chunks implementation working on CPU.

allenai / longformer

Kernel only works on GPU. Add support for CPU #3

steps: 414930.0, #epochs: 30, batch_size: 8 <<<<<<<