Positional encoding of memory tokens for llama-type models with rope

OswaldHe / HMT-pytorch

Official Implementation of "HMT: Hierarchical Memory Transformer for Long Context Language Processing"

Apache License 2.0

52 stars 2 forks source link

Positional encoding of memory tokens for llama-type models with rope #11

Closed ifed-ucsd closed 1 week ago

ifed-ucsd commented 1 week ago

I'm curious if you had to do anything special to the positional encoding of the read / write memories when using RoPE? In a different RMT implementation https://github.com/lucidrains/recurrent-memory-transformer-pytorch/blob/35cd18deeb7965491873fcba4a15d581106eae39/recurrent_memory_transformer_pytorch/recurrent_memory_transformer.py#L409, the read / write tokens get assigned position 0 and the segment tokens starting position gets pushed out to 10000.

OswaldHe commented 1 week ago

We didn't modify or inject any extra manipulation inside the backbone model. I think the official implementation also didn't do that, nor the original RMT paper mentions any technique about RoPE. Is there any explanation about why they push the segment tokens out to 10000 in this alternate implementation?

ifed-ucsd commented 1 week ago

Thanks. In your implementation, are the write tokens always at the same position in the sequence? Say we have [read mem][segment][write mem]. Is [segment] always the same length?

Is there any explanation about why they push the segment tokens out to 10000 in this alternate implementation?

I didn't see any explanation. I filed an issue here https://github.com/lucidrains/recurrent-memory-transformer-pytorch/issues/24, but haven't received a response.

OswaldHe commented 1 week ago

Yes, the segment length is fixed.