ai4os / DEEPaaS

A REST API to serve machine learning and deep learning models
https://deepaas.readthedocs.io
Apache License 2.0
35 stars 15 forks source link

predict-train or train-predict fails with tensorflow and GPU #87

Closed vykozlov closed 4 years ago

vykozlov commented 4 years ago

Description

Doing sequence of predict-train or train-predict methods fails while using deepaas with Tensorflow on GPU. Executing predict-predict and/or train-train works.

Steps to Reproduce

  1. If after deploying the container I start only predict, it works and I can repeat it.
  2. If I start only training after deployment, it works. I can also repeat it.
  3. However, if I first start predict and then train or vice versa, it fails.

Expected behavior:

That whatever order of executed functions, they work fine.

Actual behavior

predict-train or train-predict fails, could be Tensorflow specific. The reason seems to be that predict and train are two different processes in Linux. First started process occupies GPU and the second one simply has not enough GPU memory to perform the task.

Versions

DEEPaaS 1.0.1 and 1.2.0 Tensorflow 1.12.0 and 1.14.0 Nvidia driver 418.56 on one site and 440.33.01 on another