CarperAI / trlx

A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)
MIT License
4.51k stars 471 forks source link

Issue of tensors share memory #591

Open heraldiclily opened 7 months ago

heraldiclily commented 7 months ago

🐛 Describe the bug

I'm facing this issue of shared memory when training LLM models using TRLX

Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'base_model.transformer.wte.weight', 'base_model.lm_head.weight'}]. A potential way to correctly save your model is to use save_model.

Most of forums are recommending the below configuration to fix the issue for non-RL applications:

"save_safetensors=false"

Unfortunately, TRLX library doesn’t offer this argument which is part of Transformers module. Is there any way to define it equivalently in order to resolve the “tensors share memory” problem.

Which trlX version are you using?

0.7.0

Additional system and package information

Linux 20.04 python 3.11.8 pytorch 2.2.2

RekkimiARG commented 7 months ago

do you have any solution? i meet the same problem.

PamKing7 commented 3 months ago

I seem to have solved this problem by setting safe_serialization = False on line 99 of the python 3.10/site-packages/accelerate/checkpointing.py library to False, and saving the model will use the torc.save () method by default.