TimDettmers / sparse_learning

Sparse learning library and sparse momentum resources.
MIT License
376 stars 45 forks source link

Real-life usage / practical applicability #11

Open snakers4 opened 4 years ago

snakers4 commented 4 years ago

Hi Tim!

Many thanks for this awesome repo and your paper. It is always cool when someone tries DL actually useful, accessible and more efficient!

We are building an open dataset and a set of STT / TTS models for Russian language. You can see some of our published work here.

A quick recap of our findings in this field to provide some context why I am asking my question (bear with me for a moment):

Obviously, your paper is very different in technical approach, but very similar in spirit to what we have done.

You also report these results (obviously, we are more interested in ImageNet results):

image

image

Now, a couple of questions (maybe I missed it in the paper):

Many thanks for your feedback!

wuzhiyang2016 commented 4 years ago

the paper is updated, the table data is changed~

snakers4 commented 4 years ago

Do you mean these?

image

image

Since you do not report speed-ups on Imagenet, does it mean that it actually takes much longer to train such a sparse network on Imagenet?

yuanyuanli85 commented 4 years ago

In my understanding, this training speed up is based on projection with network sparsity, not measured by the real run with exsiting framework and hardware. To acheive those speed up, you need a special hardware or software to take advantage of sparsity. It should be not big difference b/w training dense and sparse network in this repo.

TimDettmers commented 4 years ago

The convergence rate is approximately the same for sparse and dense networks. What I saw is that the networks react a bit differently to certain learning rates. You can run sparse networks with slightly higher learning rates and this is something not explored in the paper. I kept the learning rates the same to do not give the sparse network an unfair advantage.

The speedups on ImageNet should be a bit larger. In general, for larger datasets and networks I see an increase in speedups. What @yuanyuanli85 says is correct. To utilize these setups you need specialized software and probably also specialized hardware (like a Graphcore or Cerebras processors).