BerkeleyAutomation / gqcnn

Python module for GQ-CNN training and deployment with ROS integration.
https://berkeleyautomation.github.io/gqcnn
Other
306 stars 149 forks source link

Issue: Bug/Performance Issue [Replication] #114

Closed anmakon closed 4 years ago

anmakon commented 4 years ago

System information

Describe the result you are trying to replicate

I'm trying to replicate the results from the Dex-Net 2.0 paper (https://arxiv.org/pdf/1703.09312v3.pdf) by training a GQ-CNN from scratch with the provided scripts (https://berkeleyautomation.github.io/gqcnn/replication/replication.html). I've had a look at both the benchmark results (https://berkeleyautomation.github.io/gqcnn/benchmarks/benchmarks.html) and the results in the Dex-Net 2.0 paper:

The GQ-CNN traned on all of Dex-net 2.0 had an accuracy of 85.7% on a held out validation set of approximately 1.3 million datapoints.

Now I'm not sure as to how these refer to the replication training results. In more detail, it seems to me that in the Dex-Net 2.0 paper, the GQ-CNN network was trained only for 5 epochs on 1/5th of Dex-Net (50 epochs in cfg/train_dex-net_2.0.yaml) and with a batch size of 128 (64 in cfg/train_dex-net_2.0.yaml).

My exact question

Could you clarify how the difference in results of the paper and the benchmark emerged? Furthermore, do the variables in cfg/train_dex-net_2.0.yaml refer to the Dex-Net 2.0 benchmark or the results in the Dex-Net 2.0 paper? I'm afraid my questions are quite similar to [#99] and [#104]. I couldn't get the answer I sought from the replies.

The training and analysing of the network worked like a charm for me. Thank you so much for providing and maintaining it!

visatish commented 4 years ago

Hi @anmakon,

Apologies for the confusion. To obtain the benchmark, we actually generated a new dataset with (I need to double check this) ~2.5 million datapoints and trained it with the parameters in cfg/train_dex-net_2.0.yaml. The object set used and generation parameters (other than the size) were the exact same. This is the dataset we have released to the public.

Now the question is why the parameters such as size, epochs, and bsz are different. To be completely honest, there's no good reason for this. I came onto the project after Dex-Net 2.0 was published and refactored our learning codebase, which eventually became this GQ-CNN repo. At that time I wanted to test the whole dataset generation and training process end-to-end from scratch, and I guess I was just curious as to whether or not we actually needed 6.7 million datapoints. I also obviously tweaked the number of epochs (our data throughput during training improved significantly in the refactor which is why were were able to effectively train on 0.8 * 2.5 * 50 = 100 million samples vs the original 0.8 * 6.7 * 5 = 26.8 million samples in about 1/4th of the time - I believe we average ~12 hrs. on a single v100 GPU). The change in batch size is essentially arbitrary as we've found that it doesn't really affect our learning. If anything I guess 128 would be more efficient.

But if you follow the instructions we released with the publicly available dataset, you should be able to achieve the replication results. Let me know if there's an issue there. I just recently ran it again for those earlier GitHub issues you mentioned haha.

Thanks, Vishal

anmakon commented 4 years ago

Hi Vishal,

thanks for your fast reply and detailed explanation. I didn't realise that the dataset from the Dex-Net 2.0 paper was different in size from the one that you've used for the Dex-Net 2.0 benchmark. Just for clarification - where did you release the dataset with less datapoints? I've double checked the one that I've used to train from scratch and it seems to have 6.7 million datapoints (at least the .npz files go up until 06728.npz - and I think they all include 1000 datapoints). I've downloaded the dataset from this box folder, according to the download_dex-net_2.0.sh. And am I right in assuming that the classification accuracy results for the benchmark are on the validation set and refer to 100% - validation_error_rates? The error rates in this case would be the output of the gqcnn performance analyser.

Thanks & kind regards, Anna

visatish commented 4 years ago

Hi Anna,

Okay, I think I was confusing this with the Dex-Net 4.0 results then. I just double-checked and that one was the 2.5 million. Sorry, it was just such a long time ago haha. Anyways, the TL;DR is that I would use the replication results and not the original paper, since the replication was done with the latest version of the codebase.

And yes, the classification accuracy is just 100 - <final validation error during training or analysis error>%.

Happy to help, and let me know if there are any further questions!

Thanks, Vishal

anmakon commented 4 years ago

Hi Vishal,

thank you for clarifying. You answered all my questions, so I'm closing the issue.

Thanks, Anna