allenai / longformer

Longformer: The Long-Document Transformer
https://arxiv.org/abs/2004.05150
Apache License 2.0
2.05k stars 275 forks source link

[WIP] Running longformer on TPU using pytorch/xla #38

Open pchankh opened 4 years ago

pchankh commented 4 years ago

We try running wrapped longformer model under colab TPU and got the following errors:

Tvm binary not found. Compiling ... Exception in device=TPU:0: cannot import name 'nvcc' Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn fn(gindex, args) File "", line 66, in _mp_fn fitter.fit(train_loader, validation_loader) File "", line 47, in fit losses, final_scores = self.train_one_epoch(para_loader.per_device_loader(self.device)) File "", line 120, in train_one_epoch outputs = self.model(inputs, attention_masks) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 558, in call result = self.forward(input, *kwargs) File "", line 26, in forward seqx, = self.backbone(input_ids=input_ids, attention_mask=attention_masks) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 558, in call result = self.forward(input, **kwargs) File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 790, in forward ....

Anyway to work around this error will be appreciated. Thanks.

ibeltagy commented 4 years ago

Very cool that you are trying to get it to work on a TPU. I am curious to see how this will go.

About your error, it looks like you are trying to run the custom CUDA kernel on a TPU, which expectedly won't work. We added an implementation that doesn't require the custom CUDA kernel, and you need to switch to that; install the latest version of the code pip install --upgrade git+https://github.com/allenai/longformer.git then follow the updated example in the readme.

pchankh commented 4 years ago

Thanks. New April 27th, 2020: A PyTorch implementation of the sliding window attention

We added a PyTorch implementation of the sliding window attention that doesn't require the custom CUDA kernel. It is limited in functionality but more convenient to use for finetuning on downstream tasks.

For the above, how do we choose the pytorch implementation? Do we still use config.attention_mode = 'sliding_chunks'.

Many thanks.

ibeltagy commented 4 years ago

yes

ibeltagy commented 4 years ago

The as_strided trick is not supported in pytorch/xla, and it has to be replaced with torch.unfold. Pytroch/xla doesn't have a lowering for torch.unfold yet. Relevant issue: https://github.com/pytorch/xla/issues/2239

ibeltagy commented 4 years ago

In case you are still interested, we have a working version in this branch https://github.com/allenai/longformer/tree/trainer. We are going to clean it up and merge it into master soon, but it is usable as is.