ai4os / DEEPaaS

A REST API to serve machine learning and deep learning models
https://deepaas.readthedocs.io
Apache License 2.0
35 stars 15 forks source link

unify train and predict pools #86

Closed IgnacioHeredia closed 4 years ago

IgnacioHeredia commented 4 years ago

This fixes GPU out-of-memory problems that happened when we had two different pools (for predict and train). When we did train then predict sequentially (or viceversa) each pool wanted to have the whole GPU so out-of-memory errors happened. This won't fix out-of-memory errors when running parallel tasks on GPU (errors which also happened before).

CPU deployments shouldn't be affected.

This has been tested with the image classification package on tf 1.14 and GPU (GeForce GTX 1080). Summary of results:

Additional tests on CPU:

Stifo commented 4 years ago

Hi @IgnacioHeredia, I tried this DEEPaaS branch and got this error with MODS running TF2.0.1:

2020-03-06 14:12:05.545 154 INFO deepaas.api [-] Serving loaded V2 models: ['mods']
2020-03-06 14:12:05.546 154 CRITICAL deepaas [-] Unhandled error: AttributeError: 'CancellablePool' object has no attribute 'submit'
2020-03-06 14:12:05.546 154 ERROR deepaas Traceback (most recent call last):
2020-03-06 14:12:05.546 154 ERROR deepaas   File "/usr/local/bin/deepaas-run", line 8, in <module>
2020-03-06 14:12:05.546 154 ERROR deepaas     sys.exit(main())
2020-03-06 14:12:05.546 154 ERROR deepaas   File "/usr/local/lib/python3.6/dist-packages/deepaas/cmd/run.py", line 118, in main
2020-03-06 14:12:05.546 154 ERROR deepaas     port=CONF.listen_port,
2020-03-06 14:12:05.546 154 ERROR deepaas   File "/usr/local/lib/python3.6/dist-packages/aiohttp/web.py", line 433, in run_app
2020-03-06 14:12:05.546 154 ERROR deepaas     reuse_port=reuse_port))
2020-03-06 14:12:05.546 154 ERROR deepaas   File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
2020-03-06 14:12:05.546 154 ERROR deepaas     return future.result()
2020-03-06 14:12:05.546 154 ERROR deepaas   File "/usr/local/lib/python3.6/dist-packages/aiohttp/web.py", line 296, in _run_app
2020-03-06 14:12:05.546 154 ERROR deepaas     app = await app  # type: ignore
2020-03-06 14:12:05.546 154 ERROR deepaas   File "/usr/local/lib/python3.6/dist-packages/deepaas/api/__init__.py", line 101, in get_app
2020-03-06 14:12:05.546 154 ERROR deepaas     await m.warm()
2020-03-06 14:12:05.546 154 ERROR deepaas   File "/usr/local/lib/python3.6/dist-packages/deepaas/model/v2/wrapper.py", line 233, in warm
2020-03-06 14:12:05.546 154 ERROR deepaas     fs = [run(executor, func) for i in range(0, n)]
2020-03-06 14:12:05.546 154 ERROR deepaas   File "/usr/local/lib/python3.6/dist-packages/deepaas/model/v2/wrapper.py", line 233, in <listcomp>
2020-03-06 14:12:05.546 154 ERROR deepaas     fs = [run(executor, func) for i in range(0, n)]
2020-03-06 14:12:05.546 154 ERROR deepaas   File "/usr/lib/python3.6/asyncio/base_events.py", line 655, in run_in_executor
2020-03-06 14:12:05.546 154 ERROR deepaas     return futures.wrap_future(executor.submit(func, *args), loop=self)
2020-03-06 14:12:05.546 154 ERROR deepaas AttributeError: 'CancellablePool' object has no attribute 'submit'
2020-03-06 14:12:05.546 154 ERROR deepaas
IgnacioHeredia commented 4 years ago

Hi @Stifo, I looks like something related to the warm method (which I didn't tested). I'll fix this on Monday. Thanks :)

IgnacioHeredia commented 4 years ago

Hi @Stifo , I should be fixed now. Can you confirm it is working and that you can do a predict then train for example?

Stifo commented 4 years ago

Hello @IgnacioHeredia, I apologize for late response. I've currently tested the c272428ae5db08170c539b6fc77af0b9c4f7bfe1 commit and it worked. I was able to train a model using GPU and then make several predictions with the newly trained model.

IgnacioHeredia commented 4 years ago

Thanks @Stifo that's great!