A simple but robust PyTorch implementation of RetNet from "Retentive Network: A Successor to Transformer for Large Language Models" (https://arxiv.org/pdf/2307.08621.pdf)
Hi again (:
I've found a small problem in the current implementation of the initiallization of the RetNetDecoder class. Specifically, to build a multi-layered model, this class uses deepcopy to copy the single RetNetDecoderLayer object it recieves as input. This copy leads to the following problems:
The parameters of the layers are not I.I.D.
Consequently, the "lottery ticket hypothesis" does not apply (at least there is no established evidence for this phenomena in the non I.I.D. case).
It's not a very serious issue, but I think it's worth fixing. I would be happy to implement a solution. I wanted to discuss which design would be preferred here:
One possible solution could be to change RetNetDecoder.__init__ to get a list of layer objects (initiallized externally).
Alternatively, it is also possible to store the arguments of the layer as properties and initiallize the new layers based on the properties of the given layer.
Another possible solution could be to define a configuration object with which a RetNetDecoderLayer object is initiallized, and pass an instance of it to RetNetDecoder.__init__ instead of an actual layer object.
There may be other solutions as well. Which one do you think would be ideal here? Do you have other solution ideas?
Thanks!
Hi again (: I've found a small problem in the current implementation of the initiallization of the
RetNetDecoder
class. Specifically, to build a multi-layered model, this class usesdeepcopy
to copy the singleRetNetDecoderLayer
object it recieves as input. This copy leads to the following problems:It's not a very serious issue, but I think it's worth fixing. I would be happy to implement a solution. I wanted to discuss which design would be preferred here: One possible solution could be to change
RetNetDecoder.__init__
to get a list of layer objects (initiallized externally). Alternatively, it is also possible to store the arguments of the layer as properties and initiallize the new layers based on the properties of the given layer. Another possible solution could be to define a configuration object with which aRetNetDecoderLayer
object is initiallized, and pass an instance of it toRetNetDecoder.__init__
instead of an actual layer object.There may be other solutions as well. Which one do you think would be ideal here? Do you have other solution ideas? Thanks!