RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.

todayisYu commented 2 weeks ago

Hi，I also have a problem with training TWOSOME in Tomato Salad environment sh scripts/tomato_salad_ppo_llm.sh and encountered the following error:

pygame 2.4.0 (SDL 2.26.4, Python 3.9.20) Hello from the pygame community. https://www.pygame.org/contribute.html You are using the default legacy behaviour of the . This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.14it/s] Some parameters are on the meta device because they were offloaded to the cpu. You shouldn't move a model that is dispatched using accelerate hooks. Traceback (most recent call last): File "/home/wenyi/huangyujie/TWOSOME/TWOSOME/twosome/overcooked/ppo_llm_pomdp.py", line 192, in agent = LLMAgent(normalization_mode=args.normalization_mode, load_8bit=args.load_8bit, task=args.task) File "/home/wenyi/huangyujie/TWOSOME/TWOSOME/twosome/overcooked/policy_pomdp.py", line 73, in __init__ self.llama = self._init_llama() File "/home/wenyi/huangyujie/TWOSOME/TWOSOME/twosome/overcooked/policy_pomdp.py", line 92, in _init_llama model.half().to(self.device) File "/home/wenyi/.conda/envs/huangyujie_twosome/lib/python3.9/site-packages/accelerate/big_modeling.py", line 456, in wrapper raise RuntimeError("You can't move a model that has some modules offloaded to cpu or disk.") RuntimeError: You can't move a model that has some modules offloaded to cpu or disk. I'm guessing it's because there's not enough space on the GPU, could you give me a little bit of information about the devices that can be supported.And is it possible that the torch version is the reason?

WeihaoTan commented 1 week ago

Thanks for reaching out. I think you are right. Ideally, the training code needs about slightly less than 40GB VRAM which can be trained with an A100 40G. You can try to use a smaller batch size. I do not think torch version will solve the issue.

todayisYu commented 6 days ago

Thanks for reaching out. I think you are right. Ideally, the training code needs about slightly less than 40GB VRAM which can be trained with an A100 40G. You can try to use a smaller batch size. I do not think torch version will solve the issue.

Thanks for replying! Actually, I have 4 2080Ti 11G, Can I have a try? If possible , how can I modify the code？

WeihaoTan commented 6 days ago

I am not 100% sure but I think you can give it a try. But you need to use some model/data/pipeline parallelism trick. Use deepseek might also be helpful. You need to try to add these modules to the current code.

WeihaoTan / TWOSOME

RuntimeError: You can't move a model that has some modules offloaded to cpu or disk. #12