Closed brendan-ai2 closed 4 years ago
@matt-peters during issue review it came up that you might propose standardizing on the openai transformer (assuming experiments work out). Is that accurate?
Ideally we could just use another fully featured pytorch transformer implementation (e.g. https://github.com/pytorch/fairseq or possibly https://github.com/huggingface/pytorch-pretrained-BERT) and not have to worry about maintaining the code in allennlp. fairseq would also give us optimized CNNs and fp16 in addition to self-attention.
FWIW, looks like the new PyTorch 1.1 includes a torch.nn.MultiheadAttention
Moreover, this module was landed as a part of a bigger Transformer PR
Thanks @maxdel, we have put together a PR for Transformer in PyTorch. Feel free to comment.
The transformer module is supported by torch.nn.MultiheadAttention. We have a plan to improve the efficiency of this operator in the near future. @cpuhrsch.
Hey all, please let us know how we can help you guys deduplicate some of your work and better support this library overall!
We're eager to provide a good set of tools or a full implementation of what you need for your Transformers.
Also, if you have more general feature requests, we'd be eager to hear about them as well.
@cpuhrsch Need some help figuring out the decoder abstraction - https://github.com/allenai/allennlp/pull/2517 .
The new version of pytorch includes a nn.Transformer
module
Hello everyone, I just want to check if we could provide any help to apply nn.Transformer
module. We want to better support AllenNLP and avoid the duplicate work. We could implement more features if you feel necessary to better serve this repo.
Updated description to reflect that we don't intend to maintain our own transformer implementation anymore.
If you find any problems and difficulties when mitigating to pytorch transformer/multiheadattention, please ping me @zhangguaheng66 on pytorch Github. We are happy to provide some hands-on help.
@zhangguanheng66 fantastic and thanks for reaching back out. I'm pretty sure we'll be doing this in the next two months.
What we should do: remove all of our internal implementations of multi-headed attention / self attention, in favor of huggingface and pytorch implementations. Some of our old models depend on this code, though. As we split out those models into their own repos, our internal implementations should go with them, when it makes sense. If we have any implementations that are not used in one of the models that we are putting into their own repo, they should just be removed.
We opened an issue to keep a record of change requests for MHA and transformer. If you have some implementations what are worth landing in pytorch repo, feel free to send requests there. https://github.com/pytorch/pytorch/issues/32590
Notes to self:
BidirectionalLanguageModelTransformer
should go with the model.StackedSelfAttentionEncoder
can possibly be transparently replaced with a PyTorch implementation if we're careful with the names of the parameters?I started adding the transformer implementation from PyTorch here: #3898
This is done already.
@dirkgr Nice job for the transformer wrapper.
I saw there is a seq-to-seq encoder MultiHeadSelfAttention. Is there any plan for it?
Despite the name, that module is not strictly a transformer as designed by Devlin et al. Also, it was only used by QaNet, so I moved it there: https://github.com/allenai/allennlp-models/blob/master/allennlp_models/rc/qanet/multi_head_self_attention.py
If you need it, feel free to grab it from there and add it to your own code.
Despite the name, that module is not strictly a transformer as designed by Devlin et al. Also, it was only used by QaNet, so I moved it there: https://github.com/allenai/allennlp-models/blob/master/allennlp_models/rc/qanet/multi_head_self_attention.py
If you need it, feel free to grab it from there and add it to your own code.
Thanks @dirkgr . I'm wondering if you plan to wrap up nn.MultiheadAttention and replace multi_head_self_attention
function. To me, they are very similar.
nn.MultiheadAttention
is not a seq2seq module. It implements the attention part of the transformer architecture. The pytorch transformer code (and thus our wrapper of it) uses nn.MultiheadAttention
internally.
If you need it in your own forward()
functions, you can just call it! There is no need for AllenNLP to wrap it.
We currently have a handful of transformer implementations. At least the following:
https://github.com/allenai/allennlp/blob/master/allennlp/modules/seq2seq_encoders/multi_head_self_attention.py https://github.com/allenai/allennlp/blob/master/allennlp/modules/seq2seq_encoders/bidirectional_language_model_transformer.py#L139 https://github.com/allenai/allennlp/blob/master/allennlp/modules/openai_transformer.py
We should remove these and rely on an external implementation. Perhaps, Pytorch's
nn.Transformer
. This will (very likely) be a breaking change for our existing models.