Open shamanez opened 4 months ago
Describe the bug When I am using the most recent Megatrone-LM fork I get the following error
make: Entering directory '/workspace/megatron-lm/megatron/core/datasets' g++ -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color -I/usr/include/python3.10 -I/usr/local/lib/python3.10/dist-packages/pybind11/include helpers.cpp -o helpers.cpython-310-x86_64-linux-gnu.so make: Leaving directory '/workspace/megatron-lm/megatron/core/datasets' ERROR:megatron.core.datasets.utils:Failed to compile the C++ dataset helper functions
To Reproduce
#!/bin/bash #SBATCH --ntasks-per-node=1 #SBATCH --exclusive #SBATCH --gpus-per-node=8 #SBATCH --partition=batch # Adjust this for your cluster #SBATCH --output=/home/shamane/logs/training_scratch/log.out # Adjust this for your cluster #SBATCH --err=/home/shamane/logs/training_scratch/error.err # Adjust this for your cluster export MASTER_ADDR=$(hostname) export GPUS_PER_NODE=8 # --- export LD_LIBRARY_PATH=/usr/lib:/usr/lib64 export NCCL_TESTS_HOME=nccl-tests export NCCL_DEBUG=INFO export NCCL_ALGO=RING export NCCL_IB_AR_THRESHOLD=0 export NCCL_IB_PCI_RELAXED_ORDERING=1 export NCCL_IB_SPLIT_DATA_ON_QPS=0 export NCCL_IB_QPS_PER_CONNECTION=2 export UCX_IB_PCI_RELAXED_ORDERING=on export CUDA_DEVICE_ORDER=PCI_BUS_ID export NCCL_SOCKET_IFNAME=enp27s0np0 export NCCL_IB_HCA=mlx5_0:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_7:1,mlx5_8:1,mlx5_9:1 export NCCL_IGNORE_CPU_AFFINITY=1 # --- nodes_array=($(scontrol show hostnames $SLURM_JOB_NODELIST)) head_node=${nodes_array[0]} head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address) echo "Node IP: $head_node_ip" # Specify the Docker image to use. PYTORCH_IMAGE="nvcr.io/nvidia/pytorch:24.03-py3" # Define the path to the Megatron-LM directory on the head node. MEGATRONE_PATH="/home/shamane/Megatron-LM-luke" # Update with actual path. Path should be on the head node. # Set paths for checkpoints and tokenizer data. These should be on a shared data directory. SHARED_DIR="/data/fin_mixtral_2B/" #MASTER_ADDR=${MASTER_ADDR:-"localhost"} MASTER_ADDR=$head_node_ip MASTER_PORT=${MASTER_PORT:-"6008"} NNODES=${SLURM_NNODES:-"1"} NODE_RANK=${RANK:-"0"} WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES)) echo "SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST" echo "SLURM_NNODES: $SLURM_NNODES" echo "SLURM_NODEID: $SLURM_NODEID" echo "MASTER_ADDR: $MASTER_ADDR" echo "NNODES: $NNODES" echo "MASTER_PORT: $MASTER_PORT" echo "NODE_RANK: $NODE_RANK" #module load docker echo "-v $SHARED_DIR:/workspace/data" echo "-v $MEGATRONE_PATH:/workspace/megatron-lm" echo "$PYTORCH_IMAGE" echo "bash -c \"pip install flash-attn sentencepiece && \ bash /workspace/megatron-lm/examples/mixtral/run_mixtral_distributed.sh \ /workspace/data/megatrone_checkpoints \ /workspace/data/tokenizers/tokenizer.model \ /workspace/data/processed_data/finance_2b_mixtral_text_document \ $MASTER_ADDR \ $MASTER_PORT \ $NNODES \ $NODE_RANK\"" # # Run the Docker container with the specified PyTorch image. srun docker run \ -e SLURM_JOB_ID=$SLURM_JOB_ID \ --gpus all \ --ipc=host \ --network=host \ --workdir /workspace/megatron-lm \ -v $SHARED_DIR:/workspace/data \ -v $MEGATRONE_PATH:/workspace/megatron-lm \ $PYTORCH_IMAGE \ bash -c "pip install flash-attn sentencepiece wandb 'git+https://github.com/fanshiqing/grouped_gemm@v1.0' && \ bash /workspace/megatron-lm/examples/mixtral/run_mixtral_distributed.sh \ /workspace/data/mixtral8x7-instr-tp2-emp8-ggemm \ /workspace/data/tokenizers/tokenizer.model \ /workspace/data/processed_data/finance_2b_mixtral_text_document \ $MASTER_ADDR \ $MASTER_PORT \ $NNODES \ $NODE_RANK" # # This Docker command mounts the specified Megatron-LM and data directories, sets the working directory, # # and runs the 'run_mixtral_distributed.sh' script inside the container. # # This script facilitates distributed training using the specified PyTorch image, leveraging NVIDIA's optimizations.
Environment (please complete the following information):
PYTORCH_IMAGE="nvcr.io/nvidia/pytorch:24.03-py3"
Proposed fix If you have a proposal for how to fix the issue state it here or link to a PR.
Additional context This works well with the form that I have download 4 days ago.
Marking as stale. No activity in 60 days.
Compile manually in /megatron/core/datasets and then comment out func compile_helpers in /megatron/core/datasets/utils.py
Describe the bug When I am using the most recent Megatrone-LM fork I get the following error
To Reproduce
Environment (please complete the following information):
PYTORCH_IMAGE="nvcr.io/nvidia/pytorch:24.03-py3"
Proposed fix If you have a proposal for how to fix the issue state it here or link to a PR.
Additional context This works well with the form that I have download 4 days ago.