Open heraldiclily opened 7 months ago
do you have any solution? i meet the same problem.
I seem to have solved this problem by setting safe_serialization = False on line 99 of the python 3.10/site-packages/accelerate/checkpointing.py library to False, and saving the model will use the torc.save () method by default.
đ Describe the bug
I'm facing this issue of shared memory when training LLM models using TRLX
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'base_model.transformer.wte.weight', 'base_model.lm_head.weight'}]. A potential way to correctly save your model is to use
save_model
.Most of forums are recommending the below configuration to fix the issue for non-RL applications:
"save_safetensors=false"
Unfortunately, TRLX library doesnât offer this argument which is part of Transformers module. Is there any way to define it equivalently in order to resolve the âtensors share memoryâ problem.
Which trlX version are you using?
0.7.0
Additional system and package information
Linux 20.04 python 3.11.8 pytorch 2.2.2