apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

How to identify whether the content of the rec files for training are correctly generated? #6041

Closed dushoufu closed 7 years ago

dushoufu commented 7 years ago

I train the vgg on the MS-Celeb-1M, finding that it doest's converge at all. I'm convinced that it has something to do wtih the dataset itself, because this can perform well on others datasets, and I also use part of MS-Celeb-1M to change the number of the classes and samples of the dataset, but the result is the same.

I wonder who have the similare experience. I in particular suspect the input data which is wrongly generated by the im2rec program. But I don't know how to identify whether the rec files is correct.

By cafefully observing the print of the training, it is curious during every epoch that it gets the same Train-accuracy and Validation-accuracy.

INFO:root:Epoch[1] Batch [100] Speed: 1249.03 samples/sec Train-accuracy=0.000031 INFO:root:Epoch[1] Batch [200] Speed: 1225.23 samples/sec Train-accuracy=0.000078 INFO:root:Epoch[1] Batch [300] Speed: 1221.93 samples/sec Train-accuracy=0.000039 INFO:root:Epoch[1] Resetting Data Iterator INFO:root:Epoch[1] Time cost=366.774 INFO:root:Saved checkpoint to "./model/msimg-0002.params" INFO:root:Epoch[1] Validation-accuracy=0.000059 INFO:root:Epoch[2] Batch [100] Speed: 1248.00 samples/sec Train-accuracy=0.000031 INFO:root:Epoch[2] Batch [200] Speed: 1250.36 samples/sec Train-accuracy=0.000078 INFO:root:Epoch[2] Batch [300] Speed: 1257.64 samples/sec Train-accuracy=0.000039 INFO:root:Epoch[2] Resetting Data Iterator INFO:root:Epoch[2] Time cost=361.297 INFO:root:Saved checkpoint to "./model/msimg-0003.params" INFO:root:Epoch[2] Validation-accuracy=0.000059 INFO:root:Epoch[3] Batch [100] Speed: 1240.24 samples/sec Train-accuracy=0.000031 INFO:root:Epoch[3] Batch [200] Speed: 1232.99 samples/sec Train-accuracy=0.000078 INFO:root:Epoch[3] Batch [300] Speed: 1230.92 samples/sec Train-accuracy=0.000039

dushoufu commented 7 years ago

The clean list of MS-Celeb-1M has around 500,000,000 images and 10,000 classes. with this type of scale of data, I find it difficult to adapt a suitable learning rate to converge.

szha commented 7 years ago

This issue is closed due to lack of activity in the last 90 days. Feel free to ping me to reopen if this is still an active issue. Thanks!