BerkeleyAutomation / gqcnn

Python module for GQ-CNN training and deployment with ROS integration.
https://berkeleyautomation.github.io/gqcnn
Other
312 stars 149 forks source link

Error while training from scratch : DEXNET 2.0 #58

Closed ghost closed 5 years ago

ghost commented 5 years ago

Hi @visatish , I am trying to train Gqcnn model with Dexnet 2.0 from scratch . It starts training but after few moments throwing an error as follows : Process Process-1: Process Process-3: Traceback (most recent call last): File "/home/dl/miniconda3/envs/py27/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap self.run() File "/home/dl/miniconda3/envs/py27/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, self._kwargs) File "build/bdist.linux-x86_64/egg/gqcnn/training/tf/trainer_tf.py", line 1216, in _load_and_enqueue train_poses[start_i:end_i,:] = train_poses_arr.copy() ValueError: could not broadcast input array from shape (64,7) into shape (64,1) Traceback (most recent call last): File "/home/dl/miniconda3/envs/py27/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap self.run() File "/home/dl/miniconda3/envs/py27/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, *self._kwargs) File "build/bdist.linux-x86_64/egg/gqcnn/training/tf/trainer_tf.py", line 1216, in _load_and_enqueue train_poses[start_i:end_i,:] = train_poses_arr.copy() ValueError: could not broadcast input array from shape (64,7) into shape (64,1) Process Process-2: Traceback (most recent call last): File "/home/dl/miniconda3/envs/py27/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap self.run() File "/home/dl/miniconda3/envs/py27/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(self._args, self._kwargs) File "build/bdist.linux-x86_64/egg/gqcnn/training/tf/trainer_tf.py", line 1216, in _load_and_enqueue train_poses[start_i:end_i,:] = train_poses_arr.copy() ValueError: could not broadcast input array from shape (64,7) into shape (64,1)

Thanks in advance ! Vaibhav

visatish commented 5 years ago

Hi @Vaibhavjolly,

I just ran the Dex-Net 2.0 training a few times again and was not able to reproduce this error. Could you provide:

1) The exact command you are running. 2) The epoch/step at which you are seeing this. (I'm interested in how early-on it is happening.)

Thanks, Vishal

ghost commented 5 years ago

Hi @visatish I followed your documentation : Training from scratch (Dexnet 2) I downloaded the dataset by running : ./scripts/downloads/datasets/download_dex-net_2.0.sh And then executed : ./scripts/training/train_dex-net_2.0.sh It has not run for any epoch. This is the exact log : (py27) dl@dl-machine:~/gqcnn$ ./scripts/training/train_dex-net_2.0.sh WARNING:root:Failed to import geometry msgs in rigid_transformations.py. WARNING:root:Failed to import ros dependencies in rigid_transforms.py WARNING:root:autolab_core not installed as catkin package, RigidTransform ros me thods will be unavailable root WARNING autolab_perception is not installed as a catkin package - RO S msg conversions will not be available for image wrappers root WARNING autolab_perception is not installed as a catkin package - RO S msg conversions will not be available for image wrappers root WARNING Unable to import pylibfreenect2. Python-only Kinect driver m ay not work properly. root WARNING Unable to import openni2 driver. Python-only Primesense driv er may not work properly root WARNING Failed to import ROS in primesense_sensor.py. ROS functional ity not available root WARNING primesense_sensor.py not installed as catkin package. ROS fu nctionality not available. root WARNING Failed to import ROS in ensenso_sensor.py. ROS functionality not available trimesh WARNING No FCL -- collision checking will not work OpenGL.acceleratesupport INFO OpenGL_accelerate module loaded OpenGL.arrays.arraydatatype INFO Using accelerated ArrayDatatype GQCNNModelFactory INFO Initializing GQ-CNN with Tensorflow as backend... root INFO Root logger now logging to /home/dl/gqcnn/tools/../models/GQ CNN-2.0/training.log GQCNNTrainerTF INFO Saving model to: /home/dl/gqcnn/tools/../models/GQCNN-2. 0 GQCNNTrainerTF INFO Training split: image_wise found in dataset. GQCNNTrainerTF INFO Percent positive in train: 0.1920775838367626 GQCNNTrainerTF INFO Percent positive in val: 0.19251506572445515 GQCNNTF INFO Initializing TF Session... 2019-04-13 17:26:58.957482: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2019-04-13 17:26:59.598018: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5639b82689b0 executing computations on platform CUDA. Devices: 2019-04-13 17:26:59.598106: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1 2019-04-13 17:26:59.598133: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (1): GeForce GTX 1080 Ti, Compute Capability 6.1 2019-04-13 17:26:59.598161: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (2): GeForce GTX 1080 Ti, Compute Capability 6.1 2019-04-13 17:26:59.598186: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (3): GeForce GTX 1080 Ti, Compute Capability 6.1 2019-04-13 17:26:59.607981: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200005000 Hz 2019-04-13 17:26:59.610983: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5639b8611bd0 executing computations on platform Host. Devices: 2019-04-13 17:26:59.611056: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): , 2019-04-13 17:26:59.612200: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:18:00.0 totalMemory: 10.92GiB freeMemory: 7.93GiB 2019-04-13 17:26:59.613281: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:3b:00.0 totalMemory: 10.92GiB freeMemory: 10.37GiB 2019-04-13 17:26:59.614381: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 2 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:86:00.0 totalMemory: 10.92GiB freeMemory: 10.37GiB 2019-04-13 17:26:59.615194: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 3 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:af:00.0 totalMemory: 10.92GiB freeMemory: 10.37GiB 2019-04-13 17:26:59.616182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1, 2, 3 2019-04-13 17:26:59.623348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-04-13 17:26:59.623383: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1 2 3 2019-04-13 17:26:59.623401: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N Y Y Y 2019-04-13 17:26:59.623416: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: Y N Y Y 2019-04-13 17:26:59.623433: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 2: Y Y N Y 2019-04-13 17:26:59.623450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 3: Y Y Y N 2019-04-13 17:26:59.626254: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7718 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:18:00.0, compute capability: 6.1) 2019-04-13 17:26:59.627001: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10089 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:3b:00.0, compute capability: 6.1) 2019-04-13 17:26:59.627673: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10089 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:86:00.0, compute capability: 6.1) 2019-04-13 17:26:59.628338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10089 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:af:00.0, compute capability: 6.1) GQCNNTF INFO Building Network... GQCNNTF INFO Building Image Stream... GQCNNTF INFO Building convolutional layer: conv1_1... GQCNNTF INFO Reinitializing layer conv1_1. tensorflow WARNING From /home/dl/miniconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. GQCNNTF INFO Building convolutional layer: conv1_2... GQCNNTF INFO Reinitializing layer conv1_2. GQCNNTF INFO Building convolutional layer: conv2_1... GQCNNTF INFO Reinitializing layer conv2_1. GQCNNTF INFO Building convolutional layer: conv2_2... GQCNNTF INFO Reinitializing layer conv2_2. GQCNNTF INFO Building fully connected layer: fc3... GQCNNTF INFO Reinitializing layer fc3. tensorflow WARNING From build/bdist.linux-x86_64/egg/gqcnn/model/tf/network_tf.py:970: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use rate instead of keep_prob. Rate should be set to rate = 1 - keep_prob. GQCNNTF INFO Building Pose Stream... GQCNNTF INFO Building Fully Connected Pose Layer: pc1... GQCNNTF INFO Reinitializing layer pc1 GQCNNTF INFO Building Merge Stream... GQCNNTF INFO Building Merge Layer: fc4... GQCNNTF INFO Reinitializing layer fc4. GQCNNTF INFO Building fully connected layer: fc5... GQCNNTF INFO Reinitializing layer fc5. GQCNNTF INFO Building Softmax Layer... GQCNNTrainerTF INFO Beginning Optimization... Process Process-1: Traceback (most recent call last): File "/home/dl/miniconda3/envs/py27/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap self.run() File "/home/dl/miniconda3/envs/py27/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, self._kwargs) File "build/bdist.linux-x86_64/egg/gqcnn/training/tf/trainer_tf.py", line 1216, in _load_and_enqueue train_poses[start_i:end_i,:] = train_poses_arr.copy() ValueError: could not broadcast input array from shape (64,7) into shape (64,1) Process Process-2: Traceback (most recent call last): File "/home/dl/miniconda3/envs/py27/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap self.run() File "/home/dl/miniconda3/envs/py27/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, *self._kwargs) File "build/bdist.linux-x86_64/egg/gqcnn/training/tf/trainer_tf.py", line 1216, in _load_and_enqueue train_poses[start_i:end_i,:] = train_poses_arr.copy() ValueError: could not broadcast input array from shape (64,7) into shape (64,1) Process Process-3: Traceback (most recent call last): File "/home/dl/miniconda3/envs/py27/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap self.run() File "/home/dl/miniconda3/envs/py27/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(self._args, self._kwargs) File "build/bdist.linux-x86_64/egg/gqcnn/training/tf/trainer_tf.py", line 1216, in _load_and_enqueue train_poses[start_i:end_i,:] = train_poses_arr.copy() ValueError: could not broadcast input array from shape (64,7) into shape (64,1)

visatish commented 5 years ago

Hmm...this is weird. Sorry, but could you try/except around that line and print the following:

1) self.gripper_mode 2) self.pose_mean.shape and self.pose_std.shape (in case there is some weird broadcasting going on afterwards)

Essentially, this line is supposed to slice out the corresponding part of the saved pose tensor (dim 7) for training (should be dim 1 for the 'legacy_parallel_jaw' gripper mode).

ghost commented 5 years ago

Hi @visatish I get the following output after printing those : self.gripper_mode - legacy_parallel_jaw self.pose_mean.shape- (7,) self.pose_std.shape- (7,)

I want to train the model from scratch,but stuck here ! If I am commenting out that 1216 : train_poses[start_i:end_i,:] = train_poses_arr.copy() my training starts ! But shouldn't comment and train right ? And may I know which version of Tensorflow you are using ?

visatish commented 5 years ago

Hi @Vaibhavjolly,

When I run things on my end, both shapes are (1,). Do you have the latest version of the master branch? I am using Tensorflow 1.13.1, but I don't think that has anything to do with this.

To give you some visibility on what's going on, at this point, the pose should be converted from (7,) to (1,). This is done through read_pose_data, which is located here. It would be a good first step to make sure that this is actually called.

One thing I would try is deleting the model dir before you train again, which should be models/GQCNN-2.0 if you are using the provided shell script. The reason for this is that if the training script finds a cached pose mean/std already in there, it will try to use it. Now this shouldn't be a problem, but I just want to make sure we start with a clean slate.

In the meanwhile, I will continue to try to replicate the problem on my end.

Thanks, Vishal

ghost commented 5 years ago

Hi @visatish , Yeah,It started training .Problem was this only :- I deleted the pretrained model GQCNN -2.0 directory and after that I executed the training script . It started training . Thanks a lot, Vaibhav

visatish commented 5 years ago

Hi @Vaibhavjolly,

That's great to hear! Just to confirm, you uncommented this line: train_poses[start_i:end_i,:] = train_poses_arr.copy(), right?

Thanks, Vishal

ghost commented 5 years ago

hi @visatish , Yeah I uncommented that ! Actually,I am trying to understand the code,if you have any docs for that ,can u pls share or any suggestions ? Thanks ! Vaibhav

visatish commented 5 years ago

Hi @Vaibhavjolly,

You can refer to the API docs here. This will give you a high-level overview of how to use the various classes.

Thanks, Vishal

ghost commented 5 years ago

Thanks! @visatish

kevincheng3 commented 5 years ago

I also meet the same error when i use the dex-net2.0. i just evaluate the pre-trained GQ-CNN model. $ ./scripts/policies/run_all_dex-net_2.0_examples.sh Screenshot from 2019-09-24 16-57-52