Closed monajalal closed 1 year ago
I am following Azure MLOPs Pipeline Template git repo for creating a CV project. I am using it to finetune CIFAR10 using pretrained ResNet50 network.
So, the file below is what was produced for me by the template automatically. However, it doesn't seem to be correct for the cluster setting I have.
I have a GPU cluster that has 4 nodes, each node has 4 K80 GPUs. Could you please share with me the ndv4-top.xml file for this topology?
<!-- This topology file was copied from https://github.com/Azure/azhpc-images/blob/master/common/network-tuning.sh --> <system version="1"> <cpu numaid="0" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49"> <pci busid="ffff:ff:01.0" class="0x060400" link_speed="16 GT/s" link_width="16"> <pci busid="0001:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/> <pci busid="0101:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/> <pci busid="0002:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/> <pci busid="0102:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/> </pci> </cpu> <cpu numaid="1" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49"> <pci busid="ffff:ff:02.0" class="0x060400" link_speed="16 GT/s" link_width="16"> <pci busid="0003:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/> <pci busid="0103:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/> <pci busid="0004:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/> <pci busid="0104:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/> </pci> </cpu> <cpu numaid="2" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49"> <pci busid="ffff:ff:03.0" class="0x060400" link_speed="16 GT/s" link_width="16"> <pci busid="000b:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/> <pci busid="0105:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/> <pci busid="000c:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/> <pci busid="0106:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/> </pci> </cpu> <cpu numaid="3" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49"> <pci busid="ffff:ff:04.0" class="0x060400" link_speed="16 GT/s" link_width="16"> <pci busid="000d:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/> <pci busid="0107:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/> <pci busid="000e:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/> <pci busid="0108:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/> </pci> </cpu> </system>
I am following Azure MLOPs Pipeline Template git repo for creating a CV project. I am using it to finetune CIFAR10 using pretrained ResNet50 network.
So, the file below is what was produced for me by the template automatically. However, it doesn't seem to be correct for the cluster setting I have.
I have a GPU cluster that has 4 nodes, each node has 4 K80 GPUs. Could you please share with me the ndv4-top.xml file for this topology?