NVIDIA-Merlin / HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training
Apache License 2.0
948 stars 200 forks source link

huge_ctr --train fails with Runtime error: Illegal Memory Access #41

Closed ketanchothani13 closed 4 years ago

ketanchothani13 commented 4 years ago

I am using HugeCTR docker image: https://ngc.nvidia.com/catalog/containers/nvidia:hugectr. When training datasets which are preprocessed using nvtabular with parquet format as mentioned in example for criteo, huge_ctr fails saying Illeagal memory access.

These are the machine configurations:

  1. Azure VM with ubuntu 16.04 image
  2. Size: Standard NC6s_v3, RAM: 112 GiB, SSD: 256 GiB, Single GPU.

Error snapshot:

criteo_example
zehuanw commented 4 years ago

Thank you for your feedback!

We will try to reproduce this week and reply to you.

shijieliu commented 4 years ago

Hi, @ketanchothani13 We have fixed bug on master and the log is as attached. Thanks

[0, init_start, ]
HugeCTR Version: 2.2.1
Config file: criteo_parquet.json
[14d12h38m52s][HUGECTR][INFO]: batchsize_eval is not specified using default: 512
[14d12h38m52s][HUGECTR][INFO]: Default evaluation metric is AUC without threshold value
[14d12h38m52s][HUGECTR][INFO]: algorithm_search is not specified using default: 1
[14d12h38m52s][HUGECTR][INFO]: Algorithm search: ON
[14d12h38m52s][HUGECTR][INFO]: cuda_graph is not specified using default: 1
[14d12h38m52s][HUGECTR][INFO]: CUDA Graph: ON
[14d12h38m52s][HUGECTR][INFO]: Initial seed is 2031098203
[14d12h38m53s][HUGECTR][INFO]: Peer-to-peer access cannot be fully enabled.
Device 0: Tesla V100-SXM2-16GB
[14d12h38m54s][HUGECTR][INFO]: cache_eval_data is not specified using default: 0
[14d12h38m54s][HUGECTR][INFO]: max_nnz is not specified using default: 100
[14d12h38m54s][HUGECTR][INFO]: num_internal_buffers 1
[14d12h38m54s][HUGECTR][INFO]: num_internal_buffers 1
[14d12h38m54s][HUGECTR][INFO]: Vocabulary size: 1737715
[14d12h38m54s][HUGECTR][INFO]: max_vocabulary_size_per_gpu_=1737715
[14d12h38m54s][HUGECTR][INFO]: gpu0 start to init embedding
[14d12h38m54s][HUGECTR][INFO]: gpu0 init embedding done
[14d12h38m54s][HUGECTR][INFO]: warmup_steps is not specified using default: 1
[14d12h38m54s][HUGECTR][INFO]: decay_start is not specified using default: 0
[14d12h38m54s][HUGECTR][INFO]: decay_steps is not specified using default: 1
[14d12h38m54s][HUGECTR][INFO]: decay_power is not specified using default: 2.000000
[14d12h38m54s][HUGECTR][INFO]: end_lr is not specified using default: 0.000000
[2288.58, init_end, ]
[2288.59, run_start, ]
HugeCTR training start:
[2288.59, train_epoch_start, 0, ]
[14d12h38m54s][HUGECTR][INFO]: Iter: 200 Time(200 iters): 0.176173s Loss: 0.488863 lr:0.001000
[14d12h38m54s][HUGECTR][INFO]: Iter: 400 Time(200 iters): 0.176549s Loss: 0.511035 lr:0.001000
[14d12h38m54s][HUGECTR][INFO]: Iter: 600 Time(200 iters): 0.175732s Loss: 0.442332 lr:0.001000
[14d12h38m55s][HUGECTR][INFO]: Iter: 800 Time(200 iters): 0.175551s Loss: 0.500150 lr:0.001000
[14d12h38m55s][HUGECTR][INFO]: Iter: 1000 Time(200 iters): 0.177358s Loss: 0.463984 lr:0.001000
[3170.39, eval_start, 0.1, ]
[14d12h38m55s][HUGECTR][INFO]: Evaluation, AUC: 0.610862
[3211.09, eval_accuracy, 0.610862, 0.1, 1000, ]
[14d12h38m55s][HUGECTR][INFO]: Eval Time for 100 iters: 0.040700s
[3211.11, eval_stop, 0.1, ]
[14d12h38m55s][HUGECTR][INFO]: Iter: 1200 Time(200 iters): 0.222449s Loss: 0.488890 lr:0.001000
[14d12h38m55s][HUGECTR][INFO]: Iter: 1400 Time(200 iters): 0.235260s Loss: 0.475817 lr:0.001000
[14d12h38m55s][HUGECTR][INFO]: Iter: 1600 Time(200 iters): 0.180873s Loss: 0.542423 lr:0.001000
[14d12h38m56s][HUGECTR][INFO]: Iter: 1800 Time(200 iters): 0.180941s Loss: 0.500136 lr:0.001000
[14d12h38m56s][HUGECTR][INFO]: Iter: 2000 Time(200 iters): 0.181433s Loss: 0.496677 lr:0.001000
[4171.58, eval_start, 0.2, ]
[14d12h38m56s][HUGECTR][INFO]: Evaluation, AUC: 0.596949
[4209.16, eval_accuracy, 0.596949, 0.2, 2000, ]
[14d12h38m56s][HUGECTR][INFO]: Eval Time for 100 iters: 0.037571s
[4209.18, eval_stop, 0.2, ]
[14d12h38m56s][HUGECTR][INFO]: Iter: 2200 Time(200 iters): 0.219737s Loss: 0.477993 lr:0.001000
[14d12h38m56s][HUGECTR][INFO]: Iter: 2400 Time(200 iters): 0.182439s Loss: 0.463718 lr:0.001000
[14d12h38m56s][HUGECTR][INFO]: Iter: 2600 Time(200 iters): 0.221539s Loss: 0.498915 lr:0.001000
[14d12h38m57s][HUGECTR][INFO]: Iter: 2800 Time(200 iters): 0.179349s Loss: 0.482698 lr:0.001000
[14d12h38m57s][HUGECTR][INFO]: Iter: 3000 Time(200 iters): 0.180160s Loss: 0.527954 lr:0.001000
[5155.06, eval_start, 0.3, ]
[14d12h38m57s][HUGECTR][INFO]: Evaluation, AUC: 0.602651
[5193.13, eval_accuracy, 0.602651, 0.3, 3000, ]
[14d12h38m57s][HUGECTR][INFO]: Eval Time for 100 iters: 0.038066s
[5193.15, eval_stop, 0.3, ]
[14d12h38m57s][HUGECTR][INFO]: Iter: 3200 Time(200 iters): 0.221074s Loss: 0.524575 lr:0.001000
[14d12h38m57s][HUGECTR][INFO]: Iter: 3400 Time(200 iters): 0.181986s Loss: 0.463912 lr:0.001000
[14d12h38m57s][HUGECTR][INFO]: Iter: 3600 Time(200 iters): 0.180396s Loss: 0.468527 lr:0.001000
[14d12h38m58s][HUGECTR][INFO]: Iter: 3800 Time(200 iters): 0.215145s Loss: 0.500164 lr:0.001000
[14d12h38m58s][HUGECTR][INFO]: Iter: 4000 Time(200 iters): 0.179426s Loss: 0.472975 lr:0.001000
[6134.85, eval_start, 0.4, ]
[14d12h38m58s][HUGECTR][INFO]: Evaluation, AUC: 0.597333
[6172.8, eval_accuracy, 0.597333, 0.4, 4000, ]
[14d12h38m58s][HUGECTR][INFO]: Eval Time for 100 iters: 0.037958s
[6172.83, eval_stop, 0.4, ]
[14d12h38m58s][HUGECTR][INFO]: Iter: 4200 Time(200 iters): 0.223053s Loss: 0.483898 lr:0.001000
[14d12h38m58s][HUGECTR][INFO]: Iter: 4400 Time(200 iters): 0.187175s Loss: 0.483433 lr:0.001000
[14d12h38m58s][HUGECTR][INFO]: Iter: 4600 Time(200 iters): 0.225835s Loss: 0.470408 lr:0.001000
[14d12h38m59s][HUGECTR][INFO]: Iter: 4800 Time(200 iters): 0.186253s Loss: 0.539371 lr:0.001000
[14d12h38m59s][HUGECTR][INFO]: Iter: 5000 Time(200 iters): 0.173810s Loss: 0.468036 lr:0.001000
[7131.3, eval_start, 0.5, ]
[14d12h38m59s][HUGECTR][INFO]: Evaluation, AUC: 0.593502
[7169.4, eval_accuracy, 0.593502, 0.5, 5000, ]
[14d12h38m59s][HUGECTR][INFO]: Eval Time for 100 iters: 0.038095s
[7169.42, eval_stop, 0.5, ]
[14d12h38m59s][HUGECTR][INFO]: Iter: 5200 Time(200 iters): 0.211736s Loss: 0.511649 lr:0.001000
[14d12h38m59s][HUGECTR][INFO]: Iter: 5400 Time(200 iters): 0.174016s Loss: 0.521476 lr:0.001000
[14d12h38m59s][HUGECTR][INFO]: Iter: 5600 Time(200 iters): 0.174270s Loss: 0.519438 lr:0.001000
[14d12h39m00s][HUGECTR][INFO]: Iter: 5800 Time(200 iters): 0.212744s Loss: 0.475515 lr:0.001000
[14d12h39m00s][HUGECTR][INFO]: Iter: 6000 Time(200 iters): 0.174367s Loss: 0.464705 lr:0.001000
[8078.68, eval_start, 0.6, ]
[14d12h39m00s][HUGECTR][INFO]: Evaluation, AUC: 0.585992
[8117.41, eval_accuracy, 0.585992, 0.6, 6000, ]
[14d12h39m00s][HUGECTR][INFO]: Eval Time for 100 iters: 0.038737s
[8117.44, eval_stop, 0.6, ]
[14d12h39m00s][HUGECTR][INFO]: Iter: 6200 Time(200 iters): 0.216389s Loss: 0.513463 lr:0.001000
[14d12h39m00s][HUGECTR][INFO]: Iter: 6400 Time(200 iters): 0.176376s Loss: 0.480522 lr:0.001000
[14d12h39m00s][HUGECTR][INFO]: Iter: 6600 Time(200 iters): 0.176311s Loss: 0.481125 lr:0.001000
[14d12h39m01s][HUGECTR][INFO]: Iter: 6800 Time(200 iters): 0.176517s Loss: 0.477823 lr:0.001000
[14d12h39m01s][HUGECTR][INFO]: Iter: 7000 Time(200 iters): 0.215569s Loss: 0.486222 lr:0.001000
[9040.22, eval_start, 0.7, ]
[14d12h39m01s][HUGECTR][INFO]: Evaluation, AUC: 0.594629
[9078.33, eval_accuracy, 0.594629, 0.7, 7000, ]
[14d12h39m01s][HUGECTR][INFO]: Eval Time for 100 iters: 0.038084s
[9078.34, eval_stop, 0.7, ]
[14d12h39m01s][HUGECTR][INFO]: Iter: 7200 Time(200 iters): 0.224848s Loss: 0.461744 lr:0.001000
[14d12h39m01s][HUGECTR][INFO]: Iter: 7400 Time(200 iters): 0.180002s Loss: 0.510917 lr:0.001000
[14d12h39m01s][HUGECTR][INFO]: Iter: 7600 Time(200 iters): 0.179092s Loss: 0.485318 lr:0.001000
[14d12h39m01s][HUGECTR][INFO]: Iter: 7800 Time(200 iters): 0.178849s Loss: 0.513971 lr:0.001000
[14d12h39m02s][HUGECTR][INFO]: Iter: 8000 Time(200 iters): 0.179450s Loss: 0.551650 lr:0.001000
[9982.73, eval_start, 0.8, ]
[14d12h39m02s][HUGECTR][INFO]: Evaluation, AUC: 0.592483
[10021.4, eval_accuracy, 0.592483, 0.8, 8000, ]
[14d12h39m02s][HUGECTR][INFO]: Eval Time for 100 iters: 0.038653s
[10021.4, eval_stop, 0.8, ]
[14d12h39m02s][HUGECTR][INFO]: Iter: 8200 Time(200 iters): 0.250040s Loss: 0.486155 lr:0.001000
[14d12h39m02s][HUGECTR][INFO]: Iter: 8400 Time(200 iters): 0.179807s Loss: 0.460268 lr:0.001000
[14d12h39m02s][HUGECTR][INFO]: Iter: 8600 Time(200 iters): 0.179834s Loss: 0.441795 lr:0.001000
[14d12h39m02s][HUGECTR][INFO]: Iter: 8800 Time(200 iters): 0.180014s Loss: 0.480294 lr:0.001000
[14d12h39m03s][HUGECTR][INFO]: Iter: 9000 Time(200 iters): 0.218753s Loss: 0.498632 lr:0.001000
[10991.5, eval_start, 0.9, ]
[14d12h39m03s][HUGECTR][INFO]: Evaluation, AUC: 0.594186
[11030.1, eval_accuracy, 0.594186, 0.9, 9000, ]
[14d12h39m03s][HUGECTR][INFO]: Eval Time for 100 iters: 0.038543s
[11030.1, eval_stop, 0.9, ]
[14d12h39m03s][HUGECTR][INFO]: Iter: 9200 Time(200 iters): 0.217161s Loss: 0.479245 lr:0.001000
[14d12h39m03s][HUGECTR][INFO]: Iter: 9400 Time(200 iters): 0.178106s Loss: 0.512554 lr:0.001000
[14d12h39m03s][HUGECTR][INFO]: Iter: 9600 Time(200 iters): 0.179836s Loss: 0.494459 lr:0.001000
[14d12h39m03s][HUGECTR][INFO]: Iter: 9800 Time(200 iters): 0.180167s Loss: 0.489115 lr:0.001000
[11926.4, train_epoch_end, 1, ]
[11926.5, run_stop, ]
ketanchothani13 commented 4 years ago

I see its status as Open. Is it still a bug ?

shijieliu commented 4 years ago

We have fixed it and if you dont have questions about it, we will close this issue