Closed kashif closed 2 years ago
oh thanks for the report and sorry for the delay ! I donĀ“t think that we've ever tried that, I'll have a look, should not be too big of an issue
Yeah no worries I too think itās just some variable issue.. where you are assuming both seq lengths are the sameā¦ just a bit harder for me to fix than you cause you know where to lookā¦ thanks!
ah I just realized that you're mixing different attention mechanisms, that's great ! favor needs some work on the causal side, WIP (it's correct I believe but the causal path is a memory hog right now). In that case the crash is in favor, it looks like it does not support q/k having different lengths, I'll have a look !
@blefaudeux thanks! Yes even with:
"multi_head_config_masked": {
"num_heads": 4,
"residual_dropout": 0,
"attention": {
"name": "nystrom", # whatever attention mechanism
"dropout": 0,
"causal": True,
"seq_len": SEQ_DEC,
},
},
"multi_head_config_cross": {
"num_heads": 4,
"residual_dropout": 0,
"attention": {
"name": "nystrom", # whatever attention mechanism
"dropout": 0,
"causal": True,
"seq_len": SEQ_DEC,
},
},
I get:
RuntimeError: The size of tensor a (128) must match the size of tensor b (64) at non-singleton dimension 2
ok that's not normal, it should work for all the attentions which are allow-listed here. I didnĀ“t have a lot of time to look into that, sorry, not forgotten though
ok @kashif I finally got the time to have a look, it has to do with how the causality is handled (if you don't ask for causality then it's all good as it should, at least for attentions which support different k and q dimensions). I think it makes sense in that if you ask for causality and k and q have different lengths, it's not obvious how they refer one to another ? is the shorter one related to the beginning, middle or end of the other one ? I do think that we need to properly output an error about that though, putting up a PR with a unit test right now.
On the user side what you can do is to pad SEQ_DEC where it makes sense (beginning, middle, end) so that they have the same length. Pulling in @fmassa and @dianaml0 for sanity checking
thanks, @blefaudeux for looking into this... so I am planning to use this in the context of time series forecasting... where the encoder takes a sequence (or arb. length) and outputs the memory for the decoder which then takes a sequence length described by the problem and predicts the next value (so I need a causal mask in the decoder of size SEQ_DEC x SEQ_DEC
). So I am also not sure without working through the code which one should be used... haha hopefully that makes sense?
thanks, @blefaudeux for looking into this... so I am planning to use this in the context of time series forecasting... where the encoder takes a sequence (or arb. length)
the devil is there basically, "takes a sequence of arbitrary length". Causal needs a 1:1 mapping, because you want to say ĀØdon't look into the future', and you have to know what the future is. If you have [. . . . . .]
and [. . .]
where is the future in the second sequence, when related to the first one ? If you know that the sequence beginnings are aligned, then you can pad the second one (see torch.nn.functional.pad as linked above) and dump the corresponding output (so it becomes [. . . . . .]
and [. . . x x x]
and you dump the end of the layer output), if it's aligned in another way you can pad correspondingly.
From the lib perspective it cannot know how to the two relate to another in a causal case and different sequence lengths, so I think that it's better to hard fail but we should catch and explain that properly. In the encoder/decoder setup the cross MHA will mix the encoder and decoder sequences, and that's where the causality request will fail. You can try not asking for causality there, but to my understanding this will be wrong and the model will cheat (overfit)
and outputs the memory for the decoder which then takes a sequence length described by the problem and predicts the next value (so I need a causal mask in the decoder of size
SEQ_DEC x SEQ_DEC
). So I am also not sure without working through the code which one should be used... haha hopefully that makes sense?
I'm missing a lot of context, but from a distance it also looks like GPT if the goal is to predict the next item in an autoregressive way, we have this example if that helps
thoughts @SeanNaren and @tchaton, my API experts, does the above make sense ?
thanks @blefaudeux I will go over your thoughts and think about it too...
Spoke to @blefaudeux offline and agree with him on this; I think the correct approach would be to instantiate the two separate attention blocks and within the forward function, pad to the appropriate sequence length for the second attention block!
is that ok to close @kashif ? Open to discussing this more
Thanks
š Bug
I get an error when the sequence lengths to the encoder and decoder are different, e.g. in the code snippet below:
Command
Expected behavior
however, I get: