facebookresearch / dlrm

An implementation of a deep learning recommendation model (DLRM)
MIT License
3.71k stars 825 forks source link

Issue with Activating UVM Function in torchrec_dlrm #368

Open JhengLu opened 9 months ago

JhengLu commented 9 months ago

Hi,

I encountered an issue while running the torchrec_dlrm/dlrm.main command with the provided parameters. It seems that the UVM function is not being activated properly. I've shared the command and observed different behaviors when setting the reservation rate.

Command:

CUDA_VISIBLE_DEVICES=0 numactl --cpunodebind=0 bash -c \
    'export PREPROCESSED_DATASET=./criteo-research-kaggle-output && \
    export GLOBAL_BATCH_SIZE=262144 && \
    export WORLD_SIZE=2 && \
    torchx run -s local_cwd dist.ddp -j 1x1 --script dlrm_main.py -- \
        --in_memory_binary_criteo_path $PREPROCESSED_DATASET \
        --pin_memory \
        --batch_size $((GLOBAL_BATCH_SIZE / WORLD_SIZE)) \
        --learning_rate 1.0 \
        --dataset_name criteo_kaggle \
        --embedding_dim 1024 \
        --dense_arch_layer_sizes 10240,10240,1024 \
        --over_arch_layer_sizes 4096,4096,4096,1 \
        --print_sharding_plan \
        --num_embeddings_per_feature 163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840 2>&1 | tee log/0G_500G_16384feature_numactl.log'

Thank you for your assistance.