Azure / msccl

Microsoft Collective Communication Library
MIT License
50 stars 6 forks source link

Cannot use msccl-tools' xml #37

Open Eevan-zq opened 1 month ago

Eevan-zq commented 1 month ago

Why wasn't the method I generated using msccl-tools from the XML invoked when I executed the command :

mpirun --allow-run-as-root -np 8 -x LD_LIBRARY_PATH=/home/msccl-tool/msccl/executor/msccl-executor-nccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=ALL /home/msccl-tool/msccl/tests/msccl-tests-nccl/build/all_reduce_perf -b 1 -e 32MB -f 2 -g 1 -n 100 -w 20 -z 0

and I check the code here: image I find status.algoMetas.size() = 0 and then I trace here: 75c9bf4de73d3e8fdfc16da7fc5e71d

I find all .xml files that generated by msccl-tools don't containts minBytes, is this the reason why the algorithm included in the XML wasn't scheduled when I executed the mpirun command? If so, what should I do?

jiangxiaobin96 commented 1 month ago

New msccl-tool fix this error.

Eevan-zq commented 1 month ago

by the way,
1: when I run this command: image

the xml header is image Why are minBytes and maxBytes equal to 0? Will it have any impact?

2: And the following will appear at the end of this XML file: image This may be due to an error in the final Check validation in allreduce_a100_pcie_hierarchical.py: image

I am currently unsure if the XML file generated by running python ./allreduce_a100_pcie_hierarchical.py --protocol=LL 8 1 > test.xml is correct?