dingo-actual / infini-transformer

PyTorch implementation of Infini-Transformer from "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention" (https://arxiv.org/abs/2404.07143)
MIT License
271 stars 22 forks source link

Causal Masking. #4

Closed vmarinowski closed 4 months ago

vmarinowski commented 4 months ago

Exactly where to use causal masking?

dingo-actual commented 4 months ago

I'm working on adding that now 😄

dingo-actual commented 4 months ago

Just added. Just be aware the masking isn't fully "causal" being the compressive memory updates allow tokens within a segment to attend to future tokens within that same segment. Without changing the compressive memory formulation, it's not something I can get around.

GaoXinJian-USTC commented 4 months ago

Just added. Just be aware the masking isn't fully "causal" being the compressive memory updates allow tokens within a segment to attend to future tokens within that same segment. Without changing the compressive memory formulation, it's not something I can get around.

If just use the fully causal self-attention without the compressive memory , would there be a significant decrease?

dingo-actual commented 4 months ago

Good news. I was going back over the math and the code while working on another issue yesterday and realized that compressive memory doesn't break causality. It calculates its contribution to the final attention using information from the previous segment's tokens. So if you enable causal masking, the model is indeed fully causal. I meant to say something at the time, but got sidetracked.