ShannonAI / service-streamer

Boosting your Web Services of Deep Learning Applications.
Apache License 2.0
1.22k stars 187 forks source link

超时后future_cache不能正确释放 #61

Open wykdg opened 4 years ago

wykdg commented 4 years ago

class Future(object): def init(self, task_id, task_size, future_cache_ref): self._id = task_id self._size = task_size self._future_cache_ref = future_cache_ref self._outputs = [] self._finish_event = threading.Event()

def result(self, timeout=None):
    if self._size == 0:
        self._finish_event.set()
        return []
    finished = self._finish_event.wait(timeout)

    if not finished:
        raise TimeoutError("Task: %d Timeout" % self._id)

我说的这里

    # remove from future_cache
    future_cache = self._future_cache_ref()
    if future_cache is not None:
        del future_cache[self._id]

这里,如果超时就出去了了,下面从future_cache删除task_id就不会执行,这个future_cache[self._id]就会一直都在吧,不知道我有没看错。

另外,想问下设计问题。这个del future_cache[self._id],放在BaseStreamer.output里操作会不会更加好呢,就不需要一个传弱引用了? 还望不吝指教,谢谢

xxbidiao commented 4 years ago

能否详细描述这个问题?我似乎遇到类似的问题,一次timeout会导致整个API锁死,必须重启服务器。

caseware66 commented 4 years ago

您好,我在有2个GPU,起了8个worker的情况下,并发100个请求,有时候会出现卡主现象,但是后台也没有报错,卡住的话一般我看都收集到了一个batchsize,不知道是不是和您的问题一样。还有就是卡住后必须要重启服务,要不就没法请求了,就是API锁死。

duxiaochao commented 4 years ago

我也遇到了,一次性传入过多的数据就会导致服务卡死,必须重启才能重新运算

rubby33 commented 4 years ago

我也遇到这个问题:详细的报错日志如下

[2020-06-16 15:35:14,884] ERROR in app: Exception on /sentence_type2 [GET] Traceback (most recent call last): File "/data/jiangwei/anaconda3/envs/py3.7/lib/python3.7/site-packages/flask/app.py", line 2446, in wsgi_app response = self.full_dispatch_request() File "/data/jiangwei/anaconda3/envs/py3.7/lib/python3.7/site-packages/flask/app.py", line 1951, in full_dispatch_request rv = self.handle_user_exception(e) File "/data/jiangwei/anaconda3/envs/py3.7/lib/python3.7/site-packages/flask/app.py", line 1820, in handle_user_exception reraise(exc_type, exc_value, tb) File "/data/jiangwei/anaconda3/envs/py3.7/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise raise value File "/data/jiangwei/anaconda3/envs/py3.7/lib/python3.7/site-packages/flask/app.py", line 1949, in full_dispatch_request rv = self.dispatch_request() File "/data/jiangwei/anaconda3/envs/py3.7/lib/python3.7/site-packages/flask/app.py", line 1935, in dispatch_request return self.view_functionsrule.endpoint File "service_classification_stream.py", line 69, in predict_sentence_type2 labels = streamer_mid.predict([sentence]) File "/data/jiangwei/anaconda3/envs/py3.7/lib/python3.7/site-packages/service_streamer/service_streamer.py", line 132, in predict ret = self._output(task_id) File "/data/jiangwei/anaconda3/envs/py3.7/lib/python3.7/site-packages/service_streamer/service_streamer.py", line 122, in _output batch_result = future.result(WORKER_TIMEOUT) File "/data/jiangwei/anaconda3/envs/py3.7/lib/python3.7/site-packages/service_streamer/service_streamer.py", line 41, in result raise TimeoutError("Task: %d Timeout" % self._id) TimeoutError: Task: 105 Timeout

duxiaochao commented 4 years ago

我也遇到这个问题:详细的报错日志如下

[2020-06-16 15:35:14,884] ERROR in app: Exception on /sentence_type2 [GET] Traceback (most recent call last): File "/data/jiangwei/anaconda3/envs/py3.7/lib/python3.7/site-packages/flask/app.py", line 2446, in wsgi_app response = self.full_dispatch_request() File "/data/jiangwei/anaconda3/envs/py3.7/lib/python3.7/site-packages/flask/app.py", line 1951, in full_dispatch_request rv = self.handle_user_exception(e) File "/data/jiangwei/anaconda3/envs/py3.7/lib/python3.7/site-packages/flask/app.py", line 1820, in handle_user_exception reraise(exc_type, exc_value, tb) File "/data/jiangwei/anaconda3/envs/py3.7/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise raise value File "/data/jiangwei/anaconda3/envs/py3.7/lib/python3.7/site-packages/flask/app.py", line 1949, in full_dispatch_request rv = self.dispatch_request() File "/data/jiangwei/anaconda3/envs/py3.7/lib/python3.7/site-packages/flask/app.py", line 1935, in dispatch_request return self.view_functionsrule.endpoint File "service_classification_stream.py", line 69, in predict_sentence_type2 labels = streamer_mid.predict([sentence]) File "/data/jiangwei/anaconda3/envs/py3.7/lib/python3.7/site-packages/service_streamer/service_streamer.py", line 132, in predict ret = self._output(task_id) File "/data/jiangwei/anaconda3/envs/py3.7/lib/python3.7/site-packages/service_streamer/service_streamer.py", line 122, in _output batch_result = future.result(WORKER_TIMEOUT) File "/data/jiangwei/anaconda3/envs/py3.7/lib/python3.7/site-packages/service_streamer/service_streamer.py", line 41, in result raise TimeoutError("Task: %d Timeout" % self._id) TimeoutError: Task: 105 Timeout

你这个明显是任务超时,把WORKER_TIMEOUT设置大一点应该就没问题了

duxiaochao commented 4 years ago

我也遇到了,一次性传入过多的数据就会导致服务卡死,必须重启才能重新运算

我的问题是猴子补丁有Bug,删掉后直接启用flask多线程就没问题了

rubby33 commented 4 years ago

显是任务超时,把WORKER_TIMEOUT设置大一点应该就没问题了

多谢回复,貌似没有这么简单。 我是用wrk 压测,超时设置为2s,只要发生超时了,service stream的所有的服务请求都无法响应了。Requests/sec:为 0.45。如果使用原生flask naive 方法,是ok的。

(py3.7) jiangwei@mk-Z10PE-D16-WS:~$ wrk -t8 -c100 -d20s --latency http://localhost:5005/sentence_type2?sen=%22%E6%88%91%E4%B8%8D%E8%AE%A4%E5%8F%AF%E8%BF%99%E4%B8%AA%E5%9B%BD%E5%AE%B6%22

Running 20s test @ http://localhost:5005/sentence_type2?sen=%22%E6%88%91%E4%B8%8D%E8%AE%A4%E5%8F%AF%E8%BF%99%E4%B8%AA%E5%9B%BD%E5%AE%B6%22 8 threads and 100 connections Thread Stats Avg Stdev Max +/- Stdev Latency 0.00us 0.00us 0.00us -nan% Req/Sec 1.33 2.31 4.00 66.67% Latency Distribution 50% 0.00us 75% 0.00us 90% 0.00us 99% 0.00us 9 requests in 20.03s, 1.26KB read Socket errors: connect 0, read 0, write 0, timeout 9 Requests/sec: 0.45 Transfer/sec: 64.24B

duxiaochao commented 4 years ago

显是任务超时,把WORKER_TIMEOUT设置大一点应该就没问题了

多谢回复,貌似没有这么简单。 我是用wrk 压测,超时设置为2s,只要发生超时了,service stream的所有的服务请求都无法响应了。Requests/sec:为 0.45。如果使用原生flask naive 方法,是ok的。

(py3.7) jiangwei@mk-Z10PE-D16-WS:~$ wrk -t8 -c100 -d20s --latency http://localhost:5005/sentence_type2?sen=%22%E6%88%91%E4%B8%8D%E8%AE%A4%E5%8F%AF%E8%BF%99%E4%B8%AA%E5%9B%BD%E5%AE%B6%22

Running 20s test @ http://localhost:5005/sentence_type2?sen=%22%E6%88%91%E4%B8%8D%E8%AE%A4%E5%8F%AF%E8%BF%99%E4%B8%AA%E5%9B%BD%E5%AE%B6%22 8 threads and 100 connections Thread Stats Avg Stdev Max +/- Stdev Latency 0.00us 0.00us 0.00us -nan% Req/Sec 1.33 2.31 4.00 66.67% Latency Distribution 50% 0.00us 75% 0.00us 90% 0.00us 99% 0.00us 9 requests in 20.03s, 1.26KB read Socket errors: connect 0, read 0, write 0, timeout 9 Requests/sec: 0.45 Transfer/sec: 64.24B

那可能跟我是一样的问题,你把猴子补丁去掉,应该就正常了,这个bug比较怪,我们也遇到了,现在也没什么好办法解决

kuangdd commented 2 years ago

把monkey.patch_all()放到import wsgiserver的下面,就没问题了。

Kuzhuahu commented 1 year ago

显是任务超时,把WORKER_TIMEOUT设置大一点应该就没问题了

多谢回复,貌似没有这么简单。 我是用wrk 压测,超时设置为2s,只要发生超时了,service stream的所有的服务请求都无法响应了。Requests/sec:为 0.45。如果使用原生flask naive 方法,是ok的。

(py3.7) jiangwei@mk-Z10PE-D16-WS:~$ wrk -t8 -c100 -d20s --latency http://localhost:5005/sentence_type2?sen=%22%E6%88%91%E4%B8%8D%E8%AE%A4%E5%8F%AF%E8%BF%99%E4%B8%AA%E5%9B%BD%E5%AE%B6%22

Running 20s test @ http://localhost:5005/sentence_type2?sen=%22%E6%88%91%E4%B8%8D%E8%AE%A4%E5%8F%AF%E8%BF%99%E4%B8%AA%E5%9B%BD%E5%AE%B6%22 8 threads and 100 connections Thread Stats Avg Stdev Max +/- Stdev Latency 0.00us 0.00us 0.00us -nan% Req/Sec 1.33 2.31 4.00 66.67% Latency Distribution 50% 0.00us 75% 0.00us 90% 0.00us 99% 0.00us 9 requests in 20.03s, 1.26KB read Socket errors: connect 0, read 0, write 0, timeout 9 Requests/sec: 0.45 Transfer/sec: 64.24B

请问你的问题解决了吗,我也遇到相同的问题