deepgram / kur

Descriptive Deep Learning
Apache License 2.0
816 stars 107 forks source link

Does Deepgram support multi-GPUs? Thanks! #24

Closed YinJerry closed 7 years ago

YinJerry commented 7 years ago

It seems it takes long to train a model such as deepspeech example with one GPU. I wonder whether I can add more GPUs to accelerate it. I could not find any description about this anywhere. Thank you!

scottstephenson commented 7 years ago

You've found the frontier of Deep Learning! Training on multiple GPUs is not as easy or simple as it seems it should be. Solving that problem generally is still an active area of research for the entire Deep Learning crowd (Google, Facebook, Baidu, Microsoft, etc. and Deepgram too).

You can use Kur to train really sophisticated models even without multiple GPUs though. In many cases using a GPU with Kur is 10-50x faster than using a CPU.

YinJerry commented 7 years ago

Thank you very much for your fast reply! :) I am reading a paper "Deep Speech: Scaling up end-to-end speech recognition" , where it says "Our approach is enabled particularly by multi-GPU training and by data collection and synthesis strategies to build large training sets exhibiting the distortions our system must handle (such as background noise and Lombard effect). ". I am not sure whether this deepgram project has any relations with that deep speech experiments. Anyway I am very exciting that I can see some good ongoing training sign after I updated the training data to 100 hour one instead of the default sample. :)

scottstephenson commented 7 years ago

The Baidu SVAIL team is very good at making multigpu training work for speech!

Deepgram is an A.I. company in Silicon Valley that works on speech. We have a lot in common, but we're not part of Baidu. We did replicate their Deepspeech model based on (https://arxiv.org/abs/1412.5567), making it easy for people to use DS themselves.

ajsyp commented 7 years ago

I want to point out that Kur now does support multi-GPU with the TensorFlow backend. To use it, use the parallel option in your Kurfile:

settings:
  backend:
    name: keras
    backend: tensorflow
    parallel: N

Replace N with the number of GPUs you want to use. The batch_size will be interpretted as a global batch size.

YinJerry commented 7 years ago

Great news! Thanks! I'll try that when I get another GPU.

YinJerry commented 7 years ago

When I added "parallel: 2" in my own yml file based on your suggestion (I've installed the second GPU successfully. ), it occurred such error. Please help! Thank you!

Traceback (most recent call last): File "/usr/local/bin/kur", line 11, in load_entry_point('kur==0.3.0', 'console_scripts', 'kur')() File "/usr/local/lib/python3.4/dist-packages/kur/main.py", line 269, in main sys.exit(args.func(args) or 0) File "/usr/local/lib/python3.4/dist-packages/kur/main.py", line 48, in train func = spec.get_training_function() File "/usr/local/lib/python3.4/dist-packages/kur/kurfile.py", line 282, in get_training_function model = self.get_model(provider) File "/usr/local/lib/python3.4/dist-packages/kur/kurfile.py", line 143, in get_model backend=self.get_backend(), File "/usr/local/lib/python3.4/dist-packages/kur/kurfile.py", line 170, in get_backend (self.data.get('settings') or {}).get('backend') File "/usr/local/lib/python3.4/dist-packages/kur/backend/backend.py", line 188, in from_specification result = target(*params) File "/usr/local/lib/python3.4/dist-packages/kur/backend/keras_backend.py", line 77, in init super().init(args, **kwargs) TypeError: init() got an unexpected keyword argument 'parallel'

YinJerry commented 7 years ago

If I did not add "parallel: 2", I can see the second GPU with 95% memory used but with 0% GPU utilization. At the same time, the first GPU utilization is about 30%-70% with similar 95% memory used.

antho-rousseau commented 7 years ago

Hi @YinJerry, 1°/ are you sure you're using latest git master for kur? 2°/ full memory usage on all GPUs comes from tensorflow, that's the default behaviour, you can use CUDA_VISIBLE_DEVICES=0 for instance to limit GPU visibility to the first one for TF.

YinJerry commented 7 years ago

I got a version when I got your message that kur started to support multi-GPU 4 days ago but I can try the latest version today. Thanks!

ajsyp commented 7 years ago

If it helps, I pushed a package to PyPI today (v0.4.0rc0) which also contains this.

YinJerry commented 7 years ago

I have tried the latest version and it supports my two GPUs very well! Thanks a lot! :-) One question, because I downloaded a new kur version, how can I reuse the previous trained model? Where have you stored the trained model file? Are they the files in the "log" subfolder in the folder examples?

ajsyp commented 7 years ago

The model weights are specified by the weights: entries in the Kurfile. They should continue to work for you current, multi-GPU models (but be sure to back them up just in case). Note that the weights are stored in sub-directories; these directories may look empty, but there may be hidden files (so do an ls -a to be sure).

YinJerry commented 7 years ago

I copied all files in both weights (hidden files) and log sub-directories into the new project folder. It continues to work very well! Perfect!! Thank you very much for your quick support!:)

YinJerry commented 7 years ago

During the training, I met an ERROR. Have you met this before and what problem it could be? Training data problem? [ERROR 2017-03-16 22:28:55,510 kur.model.executor:236] Exception raised during training. Traceback (most recent call last): File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1021, in _do_call return fn(*args) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1003, in _run_fn status, run_metadata) File "/usr/lib/python3.4/contextlib.py", line 66, in exit next(self.gen) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: Labels length is zero in batch 17 [[Node: CTCLoss = CTCLoss[ctc_merge_repeated=true, preprocess_collapse_repeated=false, _device="/job:localhost/replica:0/task:0/cpu:0"](Log/_659, ToInt64/_661, GatherNd, Squeeze_2/_663)]]

scottstephenson commented 7 years ago

Are any of your audio transcripts an empty string ""? The CTC loss function doesn't support empty strings (yet, at least).

YinJerry commented 7 years ago

Probably! I'll check that. Thank you!:)

YinJerry commented 7 years ago

I've checked there is no empty string in the audio transcripts. Is there any other reason? Or do you have any debug log information that I can see which transcript (such as uuid) triggered this error?

ajsyp commented 7 years ago

I can add some additional logging pretty trivially--this would be good debug info to have. But we're getting off topic from the original multi-GPU issue: can you please open a fresh issue to address this? We can move the conversation there.