Collective Permute Long Tail on trn1.32xlarge

zhdllwyc commented 1 month ago

I am launching nccl.collective_permute on a trn1.32xlarge. Within the workload, each neuron core sends data to neighboring worker following a pre-specified topology. However, some of the workers experience extremely long duration (0.2 ms) whereas most of the workers has a duration of 0.014 ms.

Below is a screen shot of the profiling result of worker 1 (0.014 ms duration). Screenshot from 2024-10-04 10-46-41

Below is the screen shot of the profiling result of worker 0 (abnormal 0.2 ms duration): Screenshot from 2024-10-04 10-46-51

The source code is:

import torch
import torch.nn as nn
import torch_xla.core.xla_model as xm

from neuronx_distributed.parallel_layers.parallel_state import (
    get_tensor_model_parallel_size,
    get_tensor_model_parallel_group,
    get_tensor_model_parallel_rank,
)

from neuronx_distributed.trace import parallel_model_trace, parallel_model_save

import torch.nn.functional as F

import neuronxcc.nki.language as nl
import neuronxcc.nki.nccl as nccl
import neuronxcc.nki.isa as nisa
from neuronxcc.nki.language import par_dim
import numpy as np

from torch_neuronx import nki_jit

@nki_jit
def basic_collective_permute_1(in_ref, out_ref):
    h, w = in_ref.shape
    send_buf = nl.ndarray((h, w), dtype=in_ref.dtype, buffer=nl.private_hbm, name="send_buf")
    recv_buf = nl.ndarray((h, w), dtype=in_ref.dtype, buffer=nl.private_hbm, name="recv_buf")

    ip_send_buf, if_send_buf = nl.mgrid[0:h, 0:w]

    nisa._tiled_offloaded_memcpy(src=in_ref[ip_send_buf, if_send_buf], dst=send_buf[ip_send_buf, if_send_buf])

    nccl.collective_permute(src=send_buf[:, :], dst=recv_buf[:, :],
                            replica_groups=[[0, 1], [1, 2], [2, 3], [3, 10], [10, 11], [11, 18], [18, 19], [19, 26], [26, 27], [27, 28], [28, 29], [29, 4], [4, 5], [5, 12], [12, 13], [13, 20], [20, 21], [21, 22], [22, 23], [23, 30], [30, 31], [31, 6], [6, 7], [7, 14], [14, 15], [15, 8], [8, 9], [9, 16], [16, 17], [17, 24], [24, 25], [25, 0]])

    nisa._tiled_offloaded_memcpy(src=recv_buf[ip_send_buf, if_send_buf], dst=out_ref[ip_send_buf, if_send_buf])

class TestModule(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, q):
        h = q.shape[0]
        w = q.shape[1]

        output = torch.zeros((h, w), dtype=q.dtype, device=q.device)
        basic_collective_permute_1(q, output)
        return output

def get_model():
    # parallel_model_trace needs a function that returns a Model and a dictionary of states.
    # See details at:
    # https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html#model-trace
    return TestModule(), {} # Dictionary of states

def run_test_module(q):

    input = (q,)

    traced_model = parallel_model_trace(
        get_model,
        input,
        tp_degree=32,
        compiler_workdir='./compile_cache',
        compiler_args=None,
        max_parallel_compilations=8,
    )
    parallel_model_save(traced_model, "./saved_model/")

    result = traced_model(q)
    return result

def main():

    h, w = 128, 4096 

    q = torch.ones([h, w], dtype=torch.bfloat16)
    result = run_test_module(q)

    print(result)

    golden = torch.ones((h, w), dtype=torch.bfloat16)
    assert(torch.allclose(result, golden))

if __name__ == "__main__":
    main()

My pip freeze is:

absl-py==2.1.0
accelerate==0.34.2
anyio==4.6.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==2.4.1
async-lru==2.0.4
attrs==24.2.0
aws-neuronx-runtime-discovery==2.9
awscli==1.34.25
babel==2.16.0
beautifulsoup4==4.12.3
bleach==6.1.0
boto3==1.35.25
botocore==1.35.25
cachetools==5.5.0
certifi==2024.8.30
cffi==1.17.1
charset-normalizer==3.3.2
cloud-tpu-client==0.10
colorama==0.4.6
comm==0.2.2
debugpy==1.8.5
decorator==5.1.1
defusedxml==0.7.1
docutils==0.16
ec2-metadata==2.13.0
environment-kernels==1.2.0
exceptiongroup==1.2.2
executing==2.1.0
fastjsonschema==2.20.0
filelock==3.16.1
fqdn==1.5.1
fsspec==2024.9.0
google-api-core==1.34.1
google-api-python-client==1.8.0
google-auth==2.35.0
google-auth-httplib2==0.2.0
googleapis-common-protos==1.65.0
h11==0.14.0
httpcore==1.0.5
httplib2==0.22.0
httpx==0.27.2
huggingface-hub==0.25.1
idna==3.10
ipykernel==6.29.5
ipython==8.27.0
ipywidgets==8.1.5
islpy==2023.2.5
isoduration==20.11.0
jedi==0.19.1
Jinja2==3.1.4
jmespath==1.0.1
json5==0.9.25
jsonpointer==3.0.0
jsonschema==4.23.0
jsonschema-specifications==2023.12.1
jupyter==1.1.1
jupyter-console==6.6.3
jupyter-events==0.10.0
jupyter-lsp==2.2.5
jupyter_client==8.6.3
jupyter_core==5.7.2
jupyter_server==2.14.2
jupyter_server_terminals==0.5.3
jupyterlab==4.2.5
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.3
jupyterlab_widgets==3.0.13
libneuronxla==2.0.4115.0
lockfile==0.12.2
MarkupSafe==2.1.5
matplotlib-inline==0.1.7
mistune==3.0.2
ml-dtypes==0.2.0
mpmath==1.3.0
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==2.8.8
neuronx-cc==2.15.128.0+56dc5a86
neuronx-distributed==0.9.0
notebook==7.2.2
notebook_shim==0.2.4
numpy==1.25.2
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.6.68
nvidia-nvtx-cu12==12.1.105
oauth2client==4.1.3
overrides==7.7.0
packaging==24.1
pandocfilters==1.5.1
parso==0.8.4
pexpect==4.9.0
pgzip==0.3.5
pillow==10.4.0
platformdirs==4.3.6
prometheus_client==0.21.0
prompt_toolkit==3.0.47
protobuf==3.20.3
psutil==6.0.0
ptyprocess==0.7.0
pure_eval==0.2.3
pyasn1==0.6.1
pyasn1_modules==0.4.1
pycparser==2.22
Pygments==2.18.0
pyparsing==3.1.4
python-daemon==3.0.1
python-dateutil==2.9.0.post0
python-json-logger==2.0.7
PyYAML==6.0.2
pyzmq==26.2.0
referencing==0.35.1
regex==2024.9.11
requests==2.31.0
requests-unixsocket==0.3.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rpds-py==0.20.0
rsa==4.7.2
s3transfer==0.10.2
safetensors==0.4.5
scipy==1.11.2
Send2Trash==1.8.3
six==1.16.0
sniffio==1.3.1
soupsieve==2.6
stack-data==0.6.3
sympy==1.13.3
terminado==0.18.1
tinycss2==1.3.0
tokenizers==0.19.1
tomli==2.0.1
torch==2.1.2
torch-neuronx==2.1.2.2.3.0
torch-xla==2.1.4
torchvision==0.16.2
tornado==6.4.1
tqdm==4.66.5
traitlets==5.14.3
transformers==4.44.2
transformers-neuronx==0.12.313
triton==2.1.0
types-python-dateutil==2.9.0.20240906
typing_extensions==4.12.2
uri-template==1.3.0
uritemplate==3.0.1
urllib3==2.2.3
wcwidth==0.2.13
webcolors==24.8.0
webencodings==0.5.1
websocket-client==1.8.0
wget==3.2
widgetsnbextension==4.0.13

My neuron-profile version is:

neuron-profile 2.19.0.0%kaena-tools/2.19@c48a122 built on 2024-08-02T17:21:14Z

When profiling, I output the profile result of the second iteration:

neuron-profile capture -n "$file" -s profile.ntff --profile-nth-exec=2

koyongse commented 1 month ago

We are looking at this issue. We will update soon. Thanks.

koyongse commented 1 month ago

Can I get the URL for the profile result? If you can attach the NEFF as well, that would be very helpful.

zhdllwyc commented 1 month ago

The profile result is hosted on my instances, but here goes my NEFF file (I have to zip it because NEFF extension is not supported here).

MODULE_SyncTensorsGraph.40_10114637376880686083.zip

Here goes the script I use to profile (%1 is python script to execute, %2 is number of worker to profile):

#!/bin/bash

# Check if file is provided as an argument
if [ -z "$1" ]; then
  echo "Please provide a file."
  exit 1
fi

# Check if the provided argument is a file
if [ ! -f "$1" ]; then
  echo "The provided argument is not a file."
  exit 1
fi

current_datetime=$(TZ="America/Los_Angeles" date +"%Y-%m-%d-%H:%M:%S")

filename="${1%.py}"

DIR="${filename}_${current_datetime}"

rm -rf /tmp/ubuntu/neuroncc_compile_workdir/*
rm -rf /var/tmp/neuron-compile-cache/neuronxcc-*/*

rm -rf "$DIR"

mkdir "$DIR"

python $1 

mv MODULE_* "$DIR"

cd "$DIR"

# Find the first file with the .neff extension in the current directory
file=$(find . -maxdepth 1 -type f -name "*.neff" | head -n 1)

neuron-profile capture -n "$file" -s profile.ntff --collectives-workers-per-node $2 --profile-nth-exec=2

mkdir profile_result
mv profile_*exec* profile_result/

mkdir profile_result_json

for ntff_file in profile_result/*; do
    echo "$ntff_file"
    rank_integer=$(echo "$ntff_file" | grep -oP '(?<=_rank_)[0-9]+')
    echo "$rank_integer"
    neuron-profile view --output-format json --output-file "./profile_result_json/profile_${rank_integer}_${current_datetime}.json" -n "$file" -s "${ntff_file}"
done

neuron-profile view -n "$file" -d profile_result --db-bucket="${current_datetime}"

cd ..

zhdllwyc commented 1 month ago

Here goes my NTFF file: profile_result.zip

koyongse commented 1 month ago

On the profile, if you hover over the CC box, you can see it is AllReduce. I could also confirm it with disassemble the binary in the NEFF. This might be because your neuron version does not support the CollectivePermute. Neuron compiler and runtime should support it.
Please use the latest Neuron version.

aws-neuron / aws-neuron-sdk

Collective Permute Long Tail on trn1.32xlarge #998