Unexpected hard disk read amount when running NS3 simulation

Hardware environment

Server: Aliyun ECS ecs.e-c1m4.large
- CPU: Intel(R) Xeon(R) Platinum 2C
- Memory: 8G
OS: Ubuntu 22.04.5 (5.15.0-106-generic)

Reproduce

Use root user.

# Pre-request
$ apt update
$ apt upgrade
$ apt install cmake
$ pip install pandas

# Clone the repository
$ git clone https://github.com/aliyun/SimAI.git
$ git clone https://github.com/aliyun/aicb.git
$ cd ./SimAI/

# Clone submodules
$ git submodule update --init --recursive
# Make sure use the newest commit
$ git submodule update --remote

# Compile SimAI-Simulation (ns3)
$ ./scripts/build.sh -c ns3

# Generate workload (use aliyun/aicb main branch)
$ cd ~/aicb
# Should change `python` to `python3`
$ sh ./scripts/megatron_workload_with_aiob.sh \
-m 7 --world_size 4096 \
--tensor_model_parallel_size 2 --pipeline_model_parallel 1 \
--frame Megatron --global_batch 8192 \
--micro_batch 1 --seq_length 4096 \
--swiglu --use_flash_attn
$ cp results/workload/gpt_7B-world_size4096-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt ~/SimAI
$ cd ~/SimAI

$ # Create network topo
$ python3 ./astra-sim-alibabacloud/inputs/topo/gen_HPN_7.0_topo_mulgpus_one_link.py -g 128 -gt A100 -bw 400Gbps -nvbw 2400Gbps

# Running
$ AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 1 -w ./gpt_7B-world_size4096-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt -n ./HPN_7_0_128_gpus_8_in_one_server_with_400Gbps_A100

Result

It hung at layer num: 483. The monitor on Aliyun shows the CPU usage is ~25%, but the hard disk read is abnormal.

`}`V~UCA%9P1YD%X7E(NH8L

Maybe the simulation needs a huge amount (>>8G) of memory, which causing using swap memory on the server.

UPD: On a 4C 16G server, the simulation stops at layer num: 599 (optimizer1), and it didn't output the collective message at last.

do.sh:

AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 3 -w ./gpt_7B-world_size4096-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt -n ./HPN_7_0_128_gpus_8_in_one_server_with_400Gbps_A100
echo $?

Command:

nohup bash do.sh 2>&1 > logs &

Tail of logs:

***** info: fwd pass comm collective for layer: cross_entropy1 is finished************
chunk size is: 16384 , size is: 16384 , layer_num is: 597 , node: 0
info: all-reduce forward pass collective issued for layer: cross_entropy2, involved dimensions:  1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
***** info: fwd pass comm collective for layer: cross_entropy2 is finished************
chunk size is: 16384 , size is: 16384 , layer_num is: 598 , node: 0
info: all-reduce forward pass collective issued for layer: cross_entropy3, involved dimensions:  1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
***** info: fwd pass comm collective for layer: cross_entropy3 is finished************
chunk size is: 4 , size is: 4 , layer_num is: 599 , node: 0
info: all-reduce forward pass collective issued for layer: optimizer1, involved dimensions:  1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0