-
Dear authors,
Nice work! I have a few questions regarding the normalization in the implementation of the RetNet and would like to consult your ideas about them:
[Here](https://github.com/microso…
-
Thank you for your work. I did some testing with your implementation and it is robust and works pretty well !
However, for non-auto-regressive applications, the throughput is pretty much worse tha…
-
Hi,
Thank you for your great work!
When I use your example code to compare the Inference Latency with Transformer-based LLM, the result is not as expected in the paper (15.6X). Could you please …
-
New RetNet is introduced i think it should be implemented in diffusers . I think you already knew about it but it may be helpful for faster inference.
![Screenshot 2023-07-22 193832](https://gith…
-
Hey,
So I'm training the model using Huggingface Trainer. If the trainer exits for any reason and I resume from checkpoint, the model no longer learns anymore. I'm using the trian.py as is and exe…
-
Thank you for your implementation, but I have encountered a bug when using the code. There is a major problem in the function `_build_decay_mask` where the last element of `decay_gammas` is set to 1. …
-
I wonder why we need twice dimensions for $\mathbf{W}_V$
-
```
python train.py \
/home/sc0111/ai/torchscale/wikitext-103/wikitextdone \
--num-workers 4 \
--arch retnet_base \
--task language_modeling \
--optimizer adam --adam-betas "(0.9, 0.98)" \
--ma…
-
Hi, when I use retnet's parallel mode to train, it's very slow, I observe the gou memory usage, it's very small, what's going on?
Thank you!
```[tasklist]
### Tasks
```
-
you may want to know https://github.com/microsoft/torchscale/commit/bf65397b26469ac9c24d83a9b779b285c1ec640b