Issue with Activating UVM Function in torchrec_dlrm

Hi,

I encountered an issue while running the torchrec_dlrm/dlrm.main command with the provided parameters. It seems that the UVM function is not being activated properly. I've shared the command and observed different behaviors when setting the reservation rate.

Command:

CUDA_VISIBLE_DEVICES=0 numactl --cpunodebind=0 bash -c \
    'export PREPROCESSED_DATASET=./criteo-research-kaggle-output && \
    export GLOBAL_BATCH_SIZE=262144 && \
    export WORLD_SIZE=2 && \
    torchx run -s local_cwd dist.ddp -j 1x1 --script dlrm_main.py -- \
        --in_memory_binary_criteo_path $PREPROCESSED_DATASET \
        --pin_memory \
        --batch_size $((GLOBAL_BATCH_SIZE / WORLD_SIZE)) \
        --learning_rate 1.0 \
        --dataset_name criteo_kaggle \
        --embedding_dim 1024 \
        --dense_arch_layer_sizes 10240,10240,1024 \
        --over_arch_layer_sizes 4096,4096,4096,1 \
        --print_sharding_plan \
        --num_embeddings_per_feature 163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840 2>&1 | tee log/0G_500G_16384feature_numactl.log'

Issue Details:
- The UVM function does not seem to be activated properly.
- When setting the reservation rate to 0.49, an error is encountered.
- Conversely, when setting the reservation rate to 0.45, a different error is observed.
Request for Guidance:
- Can you provide guidance on how to properly activate the UVM function?
- Specifically, how to address the errors mentioned above?
- My understanding is that if GPU memory is insufficient, the UVM mechanism should move part of the embeddings to CPU memory, preventing such errors.

Thank you for your assistance.

facebookresearch / dlrm

Issue with Activating UVM Function in torchrec_dlrm #368