Arturus / kaggle-web-traffic

1st place solution
MIT License
1.82k stars 667 forks source link

Run submission-final for only one model #2

Closed rgualan closed 6 years ago

rgualan commented 6 years ago

First of all, thanks for the excellent code. Now the problem: Since I only have one GPU (Nvidia Quadro), I was able to run only one model by means of:

python trainer.py --name s32 --hparam_set=s32 --n_models=1 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500

When I try to execute the submission-final file, I changed the corresponding cell as follows:

for tm in range(1): tf.reset_default_graph() t_preds.append(predict(paths, build_hparams(hparams.params_s32), back_offset=0, predict_window=63, n_models=1, target_model=tm, seed=2, batch_size=2048, asgd=True))

to account for only one model's checkpoints. However, I am getting an error that I cannot solve:

tf.reset_default_graph()

preds = predict(paths, default_hparams(), back_offset=0,

n_models=3, target_model=0, seed=2, batch_size=2048, asgd=True)

t_preds = []

for tm in range(1):

tf.reset_default_graph()

t_preds.append(predict(paths, build_hparams(hparams.params_s32), back_offset=0, predict_window=63,

                n_models=1, target_model=tm, seed=2, batch_size=2048, asgd=True))


ValueError Traceback (most recent call last)

in () 6 tf.reset_default_graph() 7 t_preds.append(predict(paths, build_hparams(hparams.params_s32), back_offset=0, predict_window=63, ----> 8 n_models=1, target_model=tm, seed=2, batch_size=2048, asgd=True)) ~/projects/kaggle-web-traffic/trainer.py in predict(checkpoints, hparams, return_x, verbose, predict_window, back_offset, n_models, target_model, asgd, seed, batch_size) 691 else: 692 var_list = None --> 693 saver = tf.train.Saver(name='eval_saver', var_list=var_list) 694 x_buffer = [] 695 predictions = None ~/bin/anaconda3/envs/kwt/lib/python3.6/site-packages/tensorflow/python/training/saver.py in __init__(self, var_list, reshape, sharded, max_to_keep, keep_checkpoint_every_n_hours, name, restore_sequentially, saver_def, builder, defer_build, allow_empty, write_version, pad_step_number, save_relative_paths, filename) 1216 self._filename = filename 1217 if not defer_build and context.in_graph_mode(): -> 1218 self.build() 1219 if self.saver_def: 1220 self._check_saver_def() ~/bin/anaconda3/envs/kwt/lib/python3.6/site-packages/tensorflow/python/training/saver.py in build(self) 1225 if context.in_eager_mode(): 1226 raise ValueError("Use save/restore instead of build in eager mode.") -> 1227 self._build(self._filename, build_save=True, build_restore=True) 1228 1229 def _build_eager(self, checkpoint_path, build_save, build_restore): ~/bin/anaconda3/envs/kwt/lib/python3.6/site-packages/tensorflow/python/training/saver.py in _build(self, checkpoint_path, build_save, build_restore) 1249 return 1250 else: -> 1251 raise ValueError("No variables to save") 1252 self._is_empty = False 1253 ValueError: No variables to save Sorry for the question and thanks in advance for your comment.
Arturus commented 6 years ago

You can train 3 models on single GPU. Just don't use --multi_gpu flag for trainer.py. You can also reduce memory requirements for training using --eval_memsize parameter (default is 5, try 2 or 1) and for prediction using batch_size parameter in predict()

Single model and multiple model graphs have slightly different structure, your error seems like you trained 3-models graph and trying to use it in single-model regime.