How much memory(RAM) is required to train a 70B Llama2 model with two 80G A800 nodes?

luo-li-ba-suo commented 1 week ago

When running the example script(ZeRO3 optim offload), I encountered a ray.exceptions.OutOfMemoryError during training. Specifically, this error occurred at the 32nd step when the gradient_accumulation_steps was set to 32. 1600GB memory each node.

Error message:

Traceback (most recent call last): File "/tmp/ray/session_2024-06-26_02-32-03_991172_281/runtime_resources/working_dir_files/_ray_pkg_c9e529e2adb6802f/examples/train_ppo_ray.py", line 289, in train(args) File "/tmp/ray/session_2024-06-26_02-32-03_991172_281/runtime_resources/working_dir_files/_ray_pkg_c9e529e2adb6802f/examples/train_ppo_ray.py", line 151, in train ray.get(refs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 863, in get_objects raise value ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory. Memory on the node (IP: 172.16.3.136, ID: 13fd40854e935ee29ac853f6cc280d570d93878db761fc875e3be956) where the task (actor ID: 54fc405672d07c4a9ee60f8912000000, name=ActorModelRayActor.init, pid=216047, memory used=391.25GB) was running was 1579.14GB / 1600.00GB (0.986965), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 1485052eb89ea34835a6af195c504d773e95948f4ef07a4dbe50c401) because it was the most recently scheduled task; to see more information about memory usage on this node, use ray logs raylet.out -ip 172.16.3.136. To see the logs of the worker, use `ray logs worker-1485052eb89ea34835a6af195c504d773e95948f4ef07a4dbe50c401out -ip 172.16.3.136. Top 10 memory users: PID MEM(GB) COMMAND 215880 391.27 ray::ActorModelRayActor.fit 216047 391.25 ray::ActorModelRayActor.fit 216046 391.25 ray::ActorModelRayActor.fit 216048 391.25 ray::ActorModelRayActor.fit 299 1.70 /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_2... 547 0.53 /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray... 215671 0.45 python3 examples/train_ppo_ray.py --ref_num_nodes 1 --ref_num_gpus_per_node 2 --reward_num_nodes 1 -... 463 0.36 /usr/bin/python /usr/local/lib/python3.10/dist-packages/ray/dashboard/dashboard.py --host=127.0.0.1 ... 462 0.24 /usr/bin/python -m ray.util.client.server --address=172.16.3.136:6379 --host=0.0.0.0 --port=10001 --... 641 0.18 /usr/bin/python -u /usr/local/lib/python3.10/dist-packages/ray/dashboard/agent.py --node-ip-address=...

hijkzzz commented 1 week ago

Do you use our docker image https://github.com/OpenLLMAI/OpenRLHF/tree/main/dockerfile?

16 GPUs with 70B are very limiting; 32 cards are recommended. And Here is an example of a merge node of 8b: https://github.com/OpenLLMAI/OpenRLHF/blob/main/examples/scripts/train_ppo_llama3_ray_colocate.sh

luo-li-ba-suo commented 1 week ago

Yes, I used this:

https://github.com/OpenLLMAI/OpenRLHF/blob/main/dockerfile/Dockerfile

luo-li-ba-suo commented 1 week ago

Oh, thanks!

mickelliu commented 6 days ago

my personal experience is 2tb of RAM to be safe. You might also want loading with bf16 for both grad accumulation type and parameters.

luo-li-ba-suo commented 6 days ago

I opened a pull request to run 70B on two nodes more conveniently by using LoRA(fixing the bug that LoRA is not compatible with vllm): https://github.com/OpenLLMAI/OpenRLHF/pull/335

Have a look, thanks!

OpenLLMAI / OpenRLHF

How much memory(RAM) is required to train a 70B Llama2 model with two 80G A800 nodes? #332