aliyun / SimAI

Apache License 2.0
59 stars 6 forks source link

Unexpected hard disk read amount when running NS3 simulation #5

Open HeRaNO opened 6 days ago

HeRaNO commented 6 days ago

Hardware environment

Reproduce

Use root user.

# Pre-request
$ apt update
$ apt upgrade
$ apt install cmake
$ pip install pandas

# Clone the repository
$ git clone https://github.com/aliyun/SimAI.git
$ git clone https://github.com/aliyun/aicb.git
$ cd ./SimAI/

# Clone submodules
$ git submodule update --init --recursive
# Make sure use the newest commit
$ git submodule update --remote

# Compile SimAI-Simulation (ns3)
$ ./scripts/build.sh -c ns3

# Generate workload (use aliyun/aicb main branch)
$ cd ~/aicb
# Should change `python` to `python3`
$ sh ./scripts/megatron_workload_with_aiob.sh \
-m 7 --world_size 4096 \
--tensor_model_parallel_size 2 --pipeline_model_parallel 1 \
--frame Megatron --global_batch 8192 \
--micro_batch 1 --seq_length 4096 \
--swiglu --use_flash_attn
$ cp results/workload/gpt_7B-world_size4096-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt ~/SimAI
$ cd ~/SimAI

$ # Create network topo
$ python3 ./astra-sim-alibabacloud/inputs/topo/gen_HPN_7.0_topo_mulgpus_one_link.py -g 128 -gt A100 -bw 400Gbps -nvbw 2400Gbps

# Running
$ AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 1 -w ./gpt_7B-world_size4096-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt -n ./HPN_7_0_128_gpus_8_in_one_server_with_400Gbps_A100

Result

It hung at layer num: 483. The monitor on Aliyun shows the CPU usage is ~25%, but the hard disk read is abnormal.

`}`V~UCA%9P1YD%X7E(NH8L

Maybe the simulation needs a huge amount (>>8G) of memory, which causing using swap memory on the server.


UPD: On a 4C 16G server, the simulation stops at layer num: 599 (optimizer1), and it didn't output the collective message at last.

do.sh:

AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 3 -w ./gpt_7B-world_size4096-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt -n ./HPN_7_0_128_gpus_8_in_one_server_with_400Gbps_A100
echo $?

Command:

nohup bash do.sh 2>&1 > logs &

Tail of logs:

***** info: fwd pass comm collective for layer: cross_entropy1 is finished************
chunk size is: 16384 , size is: 16384 , layer_num is: 597 , node: 0
info: all-reduce forward pass collective issued for layer: cross_entropy2, involved dimensions:  1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
***** info: fwd pass comm collective for layer: cross_entropy2 is finished************
chunk size is: 16384 , size is: 16384 , layer_num is: 598 , node: 0
info: all-reduce forward pass collective issued for layer: cross_entropy3, involved dimensions:  1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
***** info: fwd pass comm collective for layer: cross_entropy3 is finished************
chunk size is: 4 , size is: 4 , layer_num is: 599 , node: 0
info: all-reduce forward pass collective issued for layer: optimizer1, involved dimensions:  1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0

do.sh:

AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 3 -w ./example/microAllReduce.txt -n ./HPN_7_0_128_gpus_8_in_one_server_with_400Gbps_A100
echo $?

Command:

nohup bash do.sh 2>&1 > logs &

Tail of logs:

All data received by node 125 is 939524096
sim_finish on sent,  Thread id: 140257080956416
All data sent from node 126 is 939524096
sim_finish on received,  Thread id: 140257080956416
All data received by node 126 is 939524096
sim_finish on sent,  Thread id: 140257080956416
All data sent from node 127 is 939524096
sim_finish on received,  Thread id: 140257080956416
All data received by node 127 is 939524096
0
HUNMrsen commented 3 days ago

Yes, you can check the current memory usage by running the top command during the process, and the phenomenon of being hung is most likely due to memory limitations.