facebookresearch / dlrm

An implementation of a deep learning recommendation model (DLRM)
MIT License
3.71k stars 825 forks source link

TorchRec-DLRM with Criteo Kaggle dataset problems #342

Closed BradZhone closed 1 year ago

BradZhone commented 1 year ago

Hello @samiwilf , when I tried to run torchrec-dlrm with criteo-kaggle dataset, as following: Criteo Kaggle Display Advertising Challenge dataset usage

export PREPROCESSED_DATASET=$insert_your_path_here
export GLOBAL_BATCH_SIZE=16384 ;
export WORLD_SIZE=8 ;
torchx run -s local_cwd dist.ddp -j 1x8 --script dlrm_main.py -- \
    --in_memory_binary_criteo_path $PREPROCESSED_DATASET \
    --pin_memory \
    --mmap_mode \
    --batch_size $((GLOBAL_BATCH_SIZE / WORLD_SIZE)) \
    --learning_rate 1.0 \
    --dataset_name criteo_kaggle

I got this error:

File "./dlrm/torchrec_dlrm/data/dlrm_dataloader.py", line 87, in _get_in_memory_dataloader
dlrm_main/0 [2]:    (root_name, stage) = ("train", "test") if stage == "val" else stage
dlrm_main/0 [3]:ValueError: too many values to unpack (expected 2)

which points to: https://github.com/facebookresearch/dlrm/blob/69d22b99ec02ff868dbc1170e39686935f9d1274/torchrec_dlrm/data/dlrm_dataloader.py#L87

That may caused in the case of stage == "train" or stage == "test", so I modified this line to:

(root_name, stage) = ("train", "test") if stage == "val" else (stage, stage)

I don't know whether that's right? After that, it can run successfully however the AUC always equals to 0.5002527236938477 and can never update. Is there any solution ?

samiwilf commented 1 year ago

Thanks @BradZhone for spotting this. I made a PR for the one line fix (https://github.com/facebookresearch/dlrm/pull/345). I'm also seeing the 0.50 AUC issue. @s4ayub or @colin2328 wrote the criteo kaggle preprocessing code in the torchrec repo so they should have an idea.