Real-life usage / practical applicability

snakers4 commented 4 years ago

Hi Tim!

Many thanks for this awesome repo and your paper. It is always cool when someone tries DL actually useful, accessible and more efficient!

We are building an open dataset and a set of STT / TTS models for Russian language. You can see some of our published work here.

A quick recap of our findings in this field to provide some context why I am asking my question (bear with me for a moment):

STT is usually done by large corporations, therefore they use 8-layer 1024-sized bi-LSTMs and / or a lot of compute and / or networks with 150-300m params. Such networks are either slow or over-parametrized;
We have found that just applying key achievements from modern CV and NMT (deep residual CNNs, use separable convolutions + mix layers afterwards with 1x1 convolutions, use input sequence scaling with BPE, use curricilum learning, use SCSE layers) - you can speed up on real "in-the-wild" data 3-4x also reducing your network's memory footprint 4-5x times without losing performance (!!!) (i.e. a CNN with 150m params converged 10 days on 4x1080Ti vs a network with 30m params that converges 3-4 days on the same setup)!
Without reducing the sequence length, the network convergence speed suffers, i.e. network with 30m params without down-scaling the input sequence takes 3x iterations to converge compared to a network with 150m params, but it is 3x faster, so no REAL gain here => I wonder if the same applies to your method as well;
TLDR - MobileNet ideas can be applied in any field;

Obviously, your paper is very different in technical approach, but very similar in spirit to what we have done.

You also report these results (obviously, we are more interested in ImageNet results):

Now, a couple of questions (maybe I missed it in the paper):

How much time does it take to train a 20% sparse network on Imagenet? Did you compare the convergence speed vs. plain vanilla baseline network? Does it take 1 / 0.2 more time?
Did you run any real-life inference tests on Imagenet? I understand that in real life you use "masked" approach, because sparse-layers are just not there yet;
Do you think that anything in your code may not work well with separable convolution layers?

Many thanks for your feedback!

wuzhiyang2016 commented 4 years ago

the paper is updated, the table data is changed~

snakers4 commented 4 years ago

Do you mean these?

Since you do not report speed-ups on Imagenet, does it mean that it actually takes much longer to train such a sparse network on Imagenet?

yuanyuanli85 commented 4 years ago

In my understanding, this training speed up is based on projection with network sparsity, not measured by the real run with exsiting framework and hardware. To acheive those speed up, you need a special hardware or software to take advantage of sparsity. It should be not big difference b/w training dense and sparse network in this repo.

TimDettmers commented 4 years ago

The convergence rate is approximately the same for sparse and dense networks. What I saw is that the networks react a bit differently to certain learning rates. You can run sparse networks with slightly higher learning rates and this is something not explored in the paper. I kept the learning rates the same to do not give the sparse network an unfair advantage.

The speedups on ImageNet should be a bit larger. In general, for larger datasets and networks I see an increase in speedups. What @yuanyuanli85 says is correct. To utilize these setups you need specialized software and probably also specialized hardware (like a Graphcore or Cerebras processors).

TimDettmers / sparse_learning

Real-life usage / practical applicability #11