Error to train SimAug - Githubissues

ziyan0302 commented 2 years ago

Hello! Thank you for your work. I see you have provided a detailed tutorial for training. However, with all the preprocessing steps finished, I keep failing to run code/train.py in the tutorial. While tracing code, I guess there might be something wrong while calculating gradient. Furthermore, I find that if I remove the optimizer (Trainer.train_op in your repo) from the inputs of sess.run(inputs) in line 2045, the training process would start running smoothly.

To be clear, the inputs in original repo is:

inputs = [self.loss, self.train_op, self.wd_loss]

While I run the train.py, I would get the error shown below:

multiview data stats:
    min 1, max 4
    {1: 748, 2: 474, 3: 275, 4: 11121}
loaded 47005 data points for train
loaded 7839 data points for val
 batch_size:12, epoch:30, 3918 step every epoch, total step:117540, eval/save every 3000 steps

  0%|          | 0/117540 [00:00<?, ?it/s]
  0%|          | 0/117540 [00:11<?, ?it/s]
Traceback (most recent call last):
  File "/home/ziyan/anaconda3/envs/tf1.15/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/ziyan/anaconda3/envs/tf1.15/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/ziyan/anaconda3/envs/tf1.15/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.AlreadyExistsError: Resource __per_step_3/gradients/AddN_6/tmp_var/N10tensorflow19TemporaryVariableOp6TmpVarE
     [[{{node gradients/AddN_6/tmp_var}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "code/train.py", line 335, in <module>
    main(arguments)
  File "code/train.py", line 308, in main
    trainer.step(sess, batch)
  File "/home/ziyan/simaug/Multiverse/SimAug/code/pred_models.py", line 2073, in step
    outputs = sess.run(inputs, feed_dict=feed_dict)
  File "/home/ziyan/anaconda3/envs/tf1.15/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/ziyan/anaconda3/envs/tf1.15/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/ziyan/anaconda3/envs/tf1.15/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/ziyan/anaconda3/envs/tf1.15/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.AlreadyExistsError: Resource __per_step_3/gradients/AddN_6/tmp_var/N10tensorflow19TemporaryVariableOp6TmpVarE
     [[{{node gradients/AddN_6/tmp_var}}]]

Then, if I remove the self.train_op in inputs, train.py would run smoothly:

inputs = [self.loss, self.wd_loss]

Screen logs:

multiview data stats:
    min 1, max 4
    {1: 748, 2: 474, 3: 275, 4: 11121}
loaded 47005 data points for train
loaded 7839 data points for val
 batch_size:12, epoch:30, 3918 step every epoch, total step:117540, eval/save every 3000 steps

  0%|          | 0/117540 [00:00<?, ?it/s]
  0%|          | 1/117540 [00:20<671:10:59, 20.56s/it]
  0%|          | 2/117540 [00:38<615:44:18, 18.86s/it]
  0%|          | 3/117540 [00:56<601:51:16, 18.43s/it]
  0%|          | 4/117540 [01:14<594:36:26, 18.21s/it]
  0%|          | 5/117540 [01:31<588:21:06, 18.02s/it]
  0%|          | 6/117540 [01:49<587:02:14, 17.98s/it]

Could you check that code/train.py can be executed in the right way and show the packages installed in your environment (like pip list)?

The conda environment I used to execute your code include:

python3
tensorflow1.15 both are mentioned in README

And, all preprocessing steps have been done. So, now I don't have any clue to solve the problem. Your help would be much appreciated, I'm close to making this thing work! Thanks for your time.

JunweiLiang commented 2 years ago

removing self.train_op would result in not training anything (that is the optimization step). Base on some Google searches (like this), it may be because there some other packages in the environment that causes this. Could you try again with a clean environment with just python3.6 (I believe at the time this is my python version) + tf 1.15.4?

108618026 commented 2 years ago

Hello! @JunweiLiang Thank you for your great work.

I tried to train the SimAug model with argoverse validation data(follow the preprocess rules of Multiverse/SimAug/PREPRO.md) while I face the following error!

Traceback (most recent call last): File "code/train.py", line 323, in main(arguments) File "code/train.py", line 296, in main trainer.step(sess, batch) File "C:\Users\asd1565\Desktop\tempt\Multiverse\SimAug\code\pred_models.py", line 2056, in step outputs = sess.run(inputs, feed_dict=feed_dict) File "C:\Users\asd1565\miniconda3\envs\SimAug\lib\site-packages\tensorflow_core\python\client\session.py", line 956, in run run_metadata_ptr) File "C:\Users\asd1565\miniconda3\envs\SimAug\lib\site-packages\tensorflow_core\python\client\session.py", line 1156, in _run (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape()))) ValueError: Cannot feed value of shape (12, 8) for Tensor 'obs_scene_extra:0', which has shape '(12, 1, ?)'

The miniconda environment I used to execute your training code include: python=3.6 tensorflow-gpu=1.15.4 cuda-toolkit =10.1 cudnn = 7.6.5 and requirements that your mention in readme!

Could you give me a hint to handle this error? Thanks in advanced!

JunweiLiang commented 2 years ago

What is the command that you run? This means that you are feeding the wrong inputs to a variable.

108618026 commented 2 years ago

The command I ran is the following.

python code/train.py argoverse_prepro packed_models/ jason_simaug_model --wd 0.001 --runId 0 --obs_len 8 --pred_len 12 --emb_size 32 --enc_hidden_size 256 --dec_hidden_size 256 --activation_func tanh --keep_prob 1.0 --num_epochs 30 --batch_size 12 --init_lr 0.3 --use_gnn --learning_rate_decay 0.95 --num_epoch_per_decay 8.0 --grid_loss_weight 1.0 --grid_reg_loss_weight 0.5 --save_period 3000 --scene_h 36 --scene_w 64 --scene_conv_kernel 3 --scene_conv_dim 64 --scene_grid_strides 2,4 --use_grids 1,0 --val_grid_num 0 --train_w_onehot --adv_epsilon 0.1 --mixup_alpha 0.2 --multiview_train --multiview_exp 3 --gpuid 0

Thanks for your reply!

JunweiLiang commented 2 years ago

What about the preprocessing logs? Did you see any errors during that process? Especially when getting scene features.

108618026 commented 2 years ago

I use the argoverse validation data to train the model. The data follow the command you provide in SimAug/TESTING.md. $ wget https://next.cs.cmu.edu/data/packed_prepro_eccv2020.tgz $ tar -zxvf packed_prepro_eccv2020.tgz

Should I comment this? --multiview_train --multiview_exp 3 Because I only use one viewpoint which is ring_front_center. Appreciate your reply.

108618026 commented 2 years ago

Excuse me! @JunweiLiang If I only use single-view dataset, Could I use SimAug model for training?

Thanks in advanced!

JunweiLiang commented 2 years ago

No, the idea of SimAug is to train with multi-view samples.

HRHLALALA commented 2 years ago

Hi, I am facing the same error.

I am using tensorflow-gpu==1.15.4. I tried different machines (RTX3090, V100 32G) but none of them works. I can't upgrade it because it will take my version to tf2.

I am running this command:

python code/train.py actev_preprocess multiverse-models new_train/ --wd 0.001 --runId 0 --obs_len 8 --pred_len 12 --emb_size 32 --enc_hidden_size 256 --dec_hidden_size 256 --activation_func tanh --keep_prob 1.0 --num_epochs 80 --batch_size 20 --init_lr 0.3 --use_gnn --use_scene --learning_rate_decay 0.95 --num_epoch_per_decay 2.0 --grid_loss_weight 1.0 --grid_reg_loss_weight 0.2 --save_period 2000 --scene_h 36 --scene_w 64 --scene_conv_kernel 3 --scene_conv_dim 64 --scene_grid_strides 2,4 --use_grids 1,1 --val_grid_num 0 --train_w_onehot --gpuid 0

Can you help?

JunweiLiang commented 2 years ago

@ziyan0302 @HRHLALALA I'll install an environment to debug this weekend

HRHLALALA commented 2 years ago

@ziyan0302 @HRHLALALA I'll install an environment to debug this weekend

Hi, I figure out why it happened now. The code do not run on GPU if we simply install the library using ‘pip install tensorflow-gpu==1.15.4’. We still need to install cudnn and cudatoolkit using conda. For RTX30 series which use cu11, we need to install nvidia-tensorflow. Sorry that we are still not familiar with tensorflow.

JunweiLiang commented 2 years ago

@ziyan0302 could you confirm that? BTW, you can run [tf.test.is_gpu_available](https://www.tensorflow.org/api_docs/python/tf/test/is_gpu_available) to check

JunweiLiang / Multiverse

Error to train SimAug #31