astra-sim / astra-sim

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale
https://astra-sim.github.io/
MIT License
267 stars 112 forks source link

What should I do If I want to use Astra-Sim 2.0&NS3 to analyze collective communication? #186

Open SMAC-Zhang opened 6 months ago

SMAC-Zhang commented 6 months ago

Hello, I want to use astra-sim2.0&ns3 to compare different collective algorithm. I want simulate an allreduce sample first. I use

python3 -m chakra.et_converter.et_converter \                                                                                                       5:48:23  ☁  master ☂
    --input_type Text \
    --input_filename microAllReduce.txt \
    --output_filename ../ASTRA-sim-2.0/allReduce \   
    --num_npus 8 \
    --num_dims 1 \
    --num_passes 1

to obtain workload.

SMAC-Zhang commented 6 months ago

then I use

./ns3-dev-AstraSimNetwork-default \
        --workload-configuration="${SCRIPT_DIR:?}"/../../inputs/workload/ASTRA-sim-2.0/allReduce \
        --system-configuration="${SCRIPT_DIR:?}"/../../inputs/system/Switch.json \
        --network-configuration="../../../ns-3/scratch/config/config.txt" \
        --remote-memory-configuration="${SCRIPT_DIR:?}"/../../inputs/remote_memory/analytical/no_memory_expansion.json \
        --logical-topology-configuration="${SCRIPT_DIR:?}"/../../inputs/network/ns3/sample_8nodes_1D.json \
        --comm-group-configuration=\"empty\" \

to run it. And I get following output

ASTRA-sim + NS3
There are 8 npus: 8,
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
QP is enabled 
maxRtt=2040 maxBdp=102000
sys[0] finished, 0 cycles
sys[1] finished, 0 cycles
sys[2] finished, 0 cycles
sys[3] finished, 0 cycles
sys[4] finished, 0 cycles
sys[5] finished, 0 cycles
sys[6] finished, 0 cycles
sys[7] finished, 0 cycles

Why it happened?

Cioccoo commented 3 months ago

Hi @SMAC-Zhang , I encountered the same problem. Have you found a solution?

SMAC-Zhang commented 2 months ago

Hi @SMAC-Zhang , I encountered the same problem. Have you found a solution?

It seems something was wrong in the chakra.et_converter.et_converter, so you shouldn't convert the tracefile in astra-sim1.0. You can modify et_generator.py according to your needs and execute it to generate collective communication tracefile.