Open msameedkhan opened 4 years ago
**At the very start I was getting this error** Traceback (most recent call last): 2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/local/lib/python3.6/dist-packages/mms/service.py", line 108, in predict 2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ret = self._entry_point(input_batch, self.context) 2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/home/model-server/tmp/models/4a3eed4c207edc8eb1d4e78a953f1151424ac604/service.py", line 79, in handle 2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - result = _service.inference(image) 2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/home/model-server/tmp/models/4a3eed4c207edc8eb1d4e78a953f1151424ac604/service.py", line 41, in inference 2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - batch_size=50) 2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/home/model-server/tmp/models/4a3eed4c207edc8eb1d4e78a953f1151424ac604/easyocr/easyocr.py", line 382, in readtext 2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - add_margin, add_free_list_margin, False) 2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/home/model-server/tmp/models/4a3eed4c207edc8eb1d4e78a953f1151424ac604/easyocr/easyocr.py", line 305, in detect 2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - False, self.device) 2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/home/model-server/tmp/models/4a3eed4c207edc8eb1d4e78a953f1151424ac604/easyocr/detection.py", line 111, in get_textbox 2020-11-10 15:14:25,941 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - bboxes, polys = test_net(canvas_size, mag_ratio, detector, image, text_threshold, link_threshold, low_text, poly, device) 2020-11-10 15:14:25,941 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/home/model-server/tmp/models/4a3eed4c207edc8eb1d4e78a953f1151424ac604/easyocr/detection.py", line 37, in test_net 2020-11-10 15:14:25,941 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - x = x.to(device) 2020-11-10 15:14:25,941 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 164, in _lazy_init 2020-11-10 15:14:25,941 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - "Cannot re-initialize CUDA in forked subprocess. " + msg) 2020-11-10 15:14:25,941 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, yo **After adding this line** torch.multiprocessing.set_start_method('spawn', True) **I'm now getting the following error** Connection accepted: /home/model-server/tmp/.mms.sock.9000. 2020-11-11 05:55:46,338 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Backend worker process died 2020-11-11 05:55:46,338 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last): 2020-11-11 05:55:46,338 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/local/lib/python3.6/dist-packages/mms/model_service_worker.py", line 241, in <module> 2020-11-11 05:55:46,338 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - worker.run_server() 2020-11-11 05:55:46,338 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/local/lib/python3.6/dist-packages/mms/model_service_worker.py", line 213, in run_server 2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - p.start() 2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start 2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - self._popen = self._Popen(self) 2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/lib/python3.6/multiprocessing/context.py", line 223, in _Popen 2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - return _default_context.get_context().Process._Popen(process_obj) 2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/lib/python3.6/multiprocessing/context.py", line 284, in _Popen 2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - return Popen(process_obj) 2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__ 2020-11-11 05:55:46,340 [ERROR] epollEventLoopGroup-4-1 com.amazonaws.ml.mms.wlm.WorkerThread - Unknown exception io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer at io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown Source) 2020-11-11 05:55:46,340 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - super().__init__(process_obj) 2020-11-11 05:55:46,341 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__ 2020-11-11 05:55:46,341 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - self._launch(process_obj) 2020-11-11 05:55:46,341 [INFO ] epollEventLoopGroup-4-1 com.amazonaws.ml.mms.wlm.WorkerThread - 9000-f6cb4ddf Worker disconnected. WORKER_STARTED 2020-11-11 05:55:46,341 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch 2020-11-11 05:55:46,341 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - reduction.dump(process_obj, fp) 2020-11-11 05:55:46,341 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/lib/python3.6/multiprocessing/reduction.py", line 60, in dump 2020-11-11 05:55:46,341 [DEBUG] W-9000-get-text com.amazonaws.ml.mms.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) at com.amazonaws.ml.mms.wlm.WorkerThread.runWorker(WorkerThread.java:145) at com.amazonaws.ml.mms.wlm.WorkerThread.run(WorkerThread.java:208) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2020-11-11 05:55:46,341 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ForkingPickler(file, protocol).dump(obj) 2020-11-11 05:55:46,342 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - TypeError: can't pickle module objects 2020-11-11 05:55:46,343 [WARN ] W-9000-get-text com.amazonaws.ml.mms.wlm.BatchAggregator - Load model failed: get-text, error: Worker died. 2020-11-11 05:55:46,343 [DEBUG] W-9000-get-text com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-get-text State change WORKER_STARTED -> WORKER_STOPPED 2020-11-11 05:55:46,344 [INFO ] W-9000-get-text com.amazonaws.ml.mms.wlm.WorkerThread - Retry worker: 9000-f6cb4ddf in 1 seconds. any help would be highly appreciated. Thanks
preload_model=True是会在model_server_worker.py的服务进程里已经加装模型到gpu上,但是在multiprocessing.Process(target=self.start_worker, args=(cl_socket,))是在子进程里共享服务主进程上的gpu数据,这就造成不同进程不同gpu之间共享了数据,因此会报错