Closed ziyan0302 closed 2 years ago
removing self.train_op would result in not training anything (that is the optimization step). Base on some Google searches (like this), it may be because there some other packages in the environment that causes this. Could you try again with a clean environment with just python3.6 (I believe at the time this is my python version) + tf 1.15.4?
Hello! @JunweiLiang Thank you for your great work.
I tried to train the SimAug model with argoverse validation data(follow the preprocess rules of Multiverse/SimAug/PREPRO.md) while I face the following error!
Traceback (most recent call last):
File "code/train.py", line 323, in
The miniconda environment I used to execute your training code include: python=3.6 tensorflow-gpu=1.15.4 cuda-toolkit =10.1 cudnn = 7.6.5 and requirements that your mention in readme!
Could you give me a hint to handle this error? Thanks in advanced!
What is the command that you run? This means that you are feeding the wrong inputs to a variable.
The command I ran is the following.
python code/train.py argoverse_prepro packed_models/ jason_simaug_model --wd 0.001 --runId 0 --obs_len 8 --pred_len 12 --emb_size 32 --enc_hidden_size 256 --dec_hidden_size 256 --activation_func tanh --keep_prob 1.0 --num_epochs 30 --batch_size 12 --init_lr 0.3 --use_gnn --learning_rate_decay 0.95 --num_epoch_per_decay 8.0 --grid_loss_weight 1.0 --grid_reg_loss_weight 0.5 --save_period 3000 --scene_h 36 --scene_w 64 --scene_conv_kernel 3 --scene_conv_dim 64 --scene_grid_strides 2,4 --use_grids 1,0 --val_grid_num 0 --train_w_onehot --adv_epsilon 0.1 --mixup_alpha 0.2 --multiview_train --multiview_exp 3 --gpuid 0
Thanks for your reply!
What about the preprocessing logs? Did you see any errors during that process? Especially when getting scene features.
I use the argoverse validation data to train the model. The data follow the command you provide in SimAug/TESTING.md. $ wget https://next.cs.cmu.edu/data/packed_prepro_eccv2020.tgz $ tar -zxvf packed_prepro_eccv2020.tgz
Should I comment this? --multiview_train --multiview_exp 3 Because I only use one viewpoint which is ring_front_center. Appreciate your reply.
Excuse me! @JunweiLiang If I only use single-view dataset, Could I use SimAug model for training?
Thanks in advanced!
No, the idea of SimAug is to train with multi-view samples.
Hi, I am facing the same error.
I am using tensorflow-gpu==1.15.4. I tried different machines (RTX3090, V100 32G) but none of them works. I can't upgrade it because it will take my version to tf2.
I am running this command:
python code/train.py actev_preprocess multiverse-models new_train/ --wd 0.001 --runId 0 --obs_len 8 --pred_len 12 --emb_size 32 --enc_hidden_size 256 --dec_hidden_size 256 --activation_func tanh --keep_prob 1.0 --num_epochs 80 --batch_size 20 --init_lr 0.3 --use_gnn --use_scene --learning_rate_decay 0.95 --num_epoch_per_decay 2.0 --grid_loss_weight 1.0 --grid_reg_loss_weight 0.2 --save_period 2000 --scene_h 36 --scene_w 64 --scene_conv_kernel 3 --scene_conv_dim 64 --scene_grid_strides 2,4 --use_grids 1,1 --val_grid_num 0 --train_w_onehot --gpuid 0
Can you help?
@ziyan0302 @HRHLALALA I'll install an environment to debug this weekend
@ziyan0302 @HRHLALALA I'll install an environment to debug this weekend
Hi, I figure out why it happened now. The code do not run on GPU if we simply install the library using ‘pip install tensorflow-gpu==1.15.4’. We still need to install cudnn and cudatoolkit using conda. For RTX30 series which use cu11, we need to install nvidia-tensorflow. Sorry that we are still not familiar with tensorflow.
@ziyan0302 could you confirm that? BTW, you can run [tf.test.is_gpu_available](https://www.tensorflow.org/api_docs/python/tf/test/is_gpu_available)
to check
Hello! Thank you for your work. I see you have provided a detailed tutorial for training. However, with all the preprocessing steps finished, I keep failing to run code/train.py in the tutorial. While tracing code, I guess there might be something wrong while calculating gradient. Furthermore, I find that if I remove the optimizer (Trainer.train_op in your repo) from the inputs of sess.run(inputs) in line 2045, the training process would start running smoothly.
To be clear, the inputs in original repo is:
While I run the train.py, I would get the error shown below:
Then, if I remove the
self.train_op
in inputs, train.py would run smoothly:Screen logs:
Could you check that code/train.py can be executed in the right way and show the packages installed in your environment (like pip list)?
The conda environment I used to execute your code include:
And, all preprocessing steps have been done. So, now I don't have any clue to solve the problem. Your help would be much appreciated, I'm close to making this thing work! Thanks for your time.