Difference between this codebase and Huggingface?

Hi @ibeltagy,

I was planning to use Longformer as a backbone architecture for a different domain than NLP. I was planning to train from scratch using a different type of data. I am using the Huggingface version of the model that looks like has been created by yourself. However, I was wondering whether there is any concrete benefit from using this version instead of the HF one?

The only relevant information about this is reported in the HF documentation:

The self-attention module :obj:`LongformerSelfAttention` implemented here supports the combination of local and
    global attention but it lacks support for autoregressive attention and dilated attention. Autoregressive and
    dilated attention are more relevant for autoregressive language modeling than finetuning on downstream tasks.
    Future release will add support for autoregressive attention, but the support for dilated attention requires a
    custom CUDA kernel to be memory and compute efficient.

allenai / longformer

Difference between this codebase and Huggingface? #210