apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

Newbie Q: Slow Training #2310

Closed bdlacree closed 7 years ago

bdlacree commented 8 years ago

Hey folks---Sorry if this isn't the best place to ask, since this isn't a problem with {mxnet} but a question. I'm using {mxnet} to fit a feed forward network for text classification, where I'm categorizing documents into one of 10 classes.

When I fit one- or two-layer network, learning seems to happen pretty quickly. Accuracy might stay at 10% for a few iterations, but then picks up. When I try fitting a deeper network, the accuracy might stay ~10% for many iterations---sometimes nearly 200---before the rate starts to steadily increase. I've tried toying with the learning rate, from 0.5 to 1e-4, but it doesn't seem to make the learning happen faster.

This seems abnormal, but I don't know what I'm missing. Is there a particular parameter that I'm not thinking to adjust? Is this a common 'noob' problem in fitting a FF network?

As an aside: when I fit the network using 10 nodes on my last layer (because I have a 10-category target), my results converge to predicting a single category for every document, and learning never goes past that. When I specify 11 nodes, learning happens. Is this normal?

Thanks again!

bdlacree commented 8 years ago

Also, just to add---the converging to a single output classification happens any time I pass data with imbalanced classes to the network. (Even if the difference between the number of cases with each class label is small---say the largest class has 1000 training cases while the smallest has 980.)

But the 'aside' issue above happens whether the input data are perfectly balanced or not. Again, I'm not sure if this is normal or if it means I'm doing something foolish. Thanks.

bdlacree commented 8 years ago

If it's helpful, it seems that the default initialization was just too small for some applications. Using the sqrt(3/Number of hidden nodes) rule worked to get things moving quite a bit faster.

phunterlau commented 7 years ago

This issue is closed due to lack of activity in the last 90 days. Feel free to reopen if this is still an active issue. Thanks!