Closed cool-xiang closed 4 months ago
Directly storing high-dimensional hidden states does indeed occupy a large amount of memory space, but it makes the training more efficient. If memory is really limited, you might consider removing the offline caching mechanism and instead obtaining hidden states from the original large model through forward passes each time. However, this will obviously slow down the training process.
OK,thank you very much!
Hi author, during the training phase, does it require a large amount of physical memory to save the hidden state in ckpt format in the middle? On average, a single training data requires about 10MB, while a complete training dataset may take several terabytes. If the device's memory is insufficient, do you have any suggestions to provide? Thank you very much!