Error reproducing competition results

ndkulkarni commented 5 years ago

I am trying to reproduce the competition results based on the instructions in the README.

I download and unzip the files from the kaggle competition into the data/ folder
I run the command python make_features.py data/vars --add_days=63 which creates the following pickle files: 2017-08-15_2017-09-11.pkl, all.pkl, train_2.pkl and the directory vars/ in the data/ folder
I run the trainer python trainer.py --name s32 --hparam_set=s32 --n_models=3 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500 and receive the following error:

UnknownError (see above for traceback): CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(944): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'

I am using a p3.2xlarge AWS instance with the Deep Learning AMI with Python 3.6.5 and Tensorflow-gpu==1.12.0

If I downgrade to TF-GPU 1.10, I still get the same error.

How can I resolve this? Full output from train command

limu1928 commented 5 years ago

I have the same problem. Did you figure it out?

limu1928 commented 5 years ago

I am trying to reproduce the competition results based on the instructions in the README.

I download and unzip the files from the kaggle competition into the data/ folder

I run the command python make_features.py data/vars --add_days=63 which creates the following pickle files: 2017-08-15_2017-09-11.pkl, all.pkl, train_2.pkl and the directory vars/ in the data/ folder

I run the trainer python trainer.py --name s32 --hparam_set=s32 --n_models=3 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500 and receive the following error:

UnknownError (see above for traceback): CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(944): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'

I am using a p3.2xlarge AWS instance with the Deep Learning AMI with Python 3.6.5 and Tensorflow-gpu==1.12.0

If I downgrade to TF-GPU 1.10, I still get the same error.

How can I resolve this? Full output from train command SImply restart a new instance will work...

Arturus / kaggle-web-traffic

Error reproducing competition results #32