Issue: Bug/Performance Issue

BerkeleyAutomation / gqcnn

Python module for GQ-CNN training and deployment with ROS integration.

https://berkeleyautomation.github.io/gqcnn

Other

306 stars 149 forks source link

Issue: Bug/Performance Issue #104

Closed JohnsonQi closed 4 years ago

JohnsonQi commented 4 years ago

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Linux Ubuntu 16.04
Python version:3.5.2
Installed using pip or ROS: pip
GPU model (if applicable): Nvida 1080Ti

Describe the result you are trying to replicate (https://berkeleyautomation.github.io/gqcnn/index.html). I used train_dex-net_2.0.yaml to train the gqcnn, but I didn't get the expected results. It's really strange that the training only took 30 minutes for 5 epochs on a full dex-net 2.0 dataset you provided. (https://berkeley.app.box.com/s/6mnb2bzi5zfa7qpwyn7uq5atb7vbztng/folder/25803680060)

How can I fix this problem?

visatish commented 4 years ago

Hi @JohnsonQi,

Glad to hear that you're using the GQ-CNN package, and apologies for the trouble.

The training should definitely take longer than that. Can you share models/GQCNN-2.0/training.log?

Thanks, Vishal

JohnsonQi commented 4 years ago

Hi @JohnsonQi,

Glad to hear that you're using the GQ-CNN package, and apologies for the trouble.

The training should definitely take longer than that. Can you share models/GQCNN-2.0/training.log?

Thanks, Vishal

Hi @visatish ,

Thanks for your reply! Here is my training log, and I can't figure out where is wrong. I set "train_pct"=0.8,"totoal_pct"=1. training.log

Kind regards, Johnson

visatish commented 4 years ago

Hi @JohnsonQi,

I noticed that you're having the same issue as https://github.com/BerkeleyAutomation/gqcnn/issues/99, which was resolved over email. It turned out that the benchmark we provided was actually trained on 50 epochs instead of the default 25. I will push a fix for that shortly.

It does seem like you are training on the entire dataset (26283 steps * 64 samples/step(bsz) * 1.25(account for training split) = 2102640 samples). Can you try training for 50 epochs? I'm not sure where you got 5 epochs from, unless you manually lowered it.

In the meanwhile, I will try to replicate the result again on my end, although I did replicate it earlier this year for the other issue.

Thanks, Vishal