Equationliu / Kangaroo

Implementation of Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting
https://arxiv.org/abs/2404.18911
41 stars 5 forks source link

a question #4

Closed cool-xiang closed 4 months ago

cool-xiang commented 4 months ago

Hi author, during the training phase, does it require a large amount of physical memory to save the hidden state in ckpt format in the middle? On average, a single training data requires about 10MB, while a complete training dataset may take several terabytes. If the device's memory is insufficient, do you have any suggestions to provide? Thank you very much!

Equationliu commented 4 months ago

Directly storing high-dimensional hidden states does indeed occupy a large amount of memory space, but it makes the training more efficient. If memory is really limited, you might consider removing the offline caching mechanism and instead obtaining hidden states from the original large model through forward passes each time. However, this will obviously slow down the training process.

cool-xiang commented 4 months ago

OK,thank you very much!