-
In MultiScaleRetention class, it is mentioned that 's_n_1s' has dimensions (batch_size, heads, head_size, head_size), while in SimpleRetention, 's_n_1' is defined as 's_n_1s[i]'. However, you mentione…
Qiu30 updated
10 months ago
-
I have get two errors:
(1)
from torchscale.architecture.config import EncoderDecoderConfig
from torchscale.architecture.encoder_decoder import EncoderDecoder
config = EncoderDecoderConfig(vocab_…
-
The latest release 0.2.0 is from March (see https://pypi.org/project/torchscale/#history), predating the introduction of RetNet in this repo. As such the README is misleading since it is not possible …
-
Thank you for this amazing work,
I'm trying to include your work as a drop-in replacement of some other SSM such as Mamba and RWKV. Note that I train significantly smaller models (from 20M to 60M p…
-
2511 if has_torch_function_variadic(input, weight, bias):
2512 return handle_torch_function(
2513 layer_norm, (input, weight, bias), input, normalized_shape, weight=weight, bias=b…
Qiu30 updated
12 months ago
-
Hey,
The parallel form of Retention, it returns two values a tuple, but in your ReadMe, in one of your examples it is mentioned that parallel retention's output is just one tensor. So I am confused …
-
I'm using the Retnet base config with the following TrainingArguments:
args = TrainingArguments(
output_dir="/content/retnet-xsum",
per_device_train_batch_size=1,
per_device_eval_bat…
-
In the paper, the authors mentioned that the initialization followed DeepNet but from the code, it's kind of different. Why is there a mismatch?
```
def reset_parameters(self):
nn.init.xavier_u…
-
Hi, in your Retnet paper table4, the naiive transformer 1.3B model cost more gpu memory than 2.7B model, could you please explain why?
-
Hello authors,
I'm really happy to see this great work!
I have one question or request about the consistency of output from each forward mode.
I have been comparing three outputs by using below s…