Closed JL-er closed 8 months ago
Hi, we didn't include the recurrent form of the model. If you need this feature we will consider adding it soon. Our code is chunkwise, i..e, storing the hidden state of the last element for a single contiguous chunk.
Hi, we didn't include the recurrent form of the model. If you need this feature we will consider adding it soon. Our code is chunkwise, i..e, storing the hidden state of the last element for a single contiguous chunk.
Thanks,because I need to infer, I require the final hidden state. And I want know what memory_cache is.
i will add an interface to obtain the final hidden state soon.
i will add an interface to obtain the final hidden state soon.
Thank you so much
你的推理指的是什么? 是自回归推理 还是 eval一下给定句子的PPL? 目前感觉Pad到16的倍数不是很cost,所以暂时还没有给GLA加内部的padding机制(需要改cuda kernel有点麻烦)。retnet的实现倒是在triton kernel里面加了mask的机制,所以不需要pad。
你的推理指的是什么? 是自回归推理 还是 eval一下给定句子的PPL? 目前感觉Pad到16的倍数不是很cost,所以暂时还没有给GLA加内部的padding机制(需要改cuda kernel有点麻烦)。retnet的实现倒是在triton kernel里面加了mask的机制,所以不需要pad。
都有,推理目前没有区分块。还有个问题是,我测试chunksize的时候发现chunk越小越快,这是对的吗
如果head dimension不是很大的话,I/O cost不会很大。在这个情况下,chunk越小,并行度越高,所以越快是很有可能的。但通常为了达到更好的效果,head dimension会比较大,所以chunk越小可能I/O越大,并行带来的优势不如I/O的cost。这里面存在一个tradeoff。
如果head dimension不是很大的话,I/O cost不会很大。在这个情况下,chunk越小,并行度越高,所以越快是很有可能的。但通常为了达到更好的效果,head dimension会比较大,所以chunk越小可能I/O越大,并行带来的优势不如I/O的cost。这里面存在一个tradeoff。
谢谢,bsz 8的情况chunk 16更快应该并行度不够,bsz拉大应该更适合128也会更快
I wanted to get St but I couldn't find his exact location, I returned memory_cache but he doesn't seem to be equal to St