NRGI / resourcecontracts.org

Resource Contracts
http://resourcecontracts.org
GNU General Public License v2.0
16 stars 9 forks source link

Improve Abbyy OCR SDK connection handling for pdf-processor #1349

Closed iprunache closed 3 years ago

iprunache commented 3 years ago

Why

OCR processing of uploaded contracts often gets stuck or fails because the pdf-processor service sometimes fails to get a response from the Abbyy OCR SDK. All pages for uploaded contracts should be properly processed.

What

Notes

See discussion started here: https://github.com/NRGI/resourcecontracts.org/issues/1340#issuecomment-755229710

Most of the time PDF processing gets stuck when trying to retrieve the status of an Abbyy task(no answer is ever received):

Jan 06 10:06:33 77a9730d13bb ecs-rc-admin-master-nrgi-queue.log 2021-01-06 08:06:33,046 Abbyy INFO - Processing /var/www/rc-admin/public/data/4606/pages/97.pdf
Jan 06 10:06:33 77a9730d13bb ecs-rc-admin-master-nrgi-queue.log 2021-01-06 08:06:33,603 Abbyy INFO - Task Id: e3074294-95d2-4c3e-b6dd-935da0bc9ebe, status Queued
Jan 06 10:13:20 77a9730d13bb ecs-rc-admin-master-apache2-access.log 10.0.0.194 - - [06/Jan/2021:08:13:20 +0000] "GET / HTTP/1.1" 200 1893 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"

Sometimes API calls fail and there's no retry:

Jan 06 17:38:43 77a9730d13bb ecs-rc-admin-master-pdf-processor.log 2021-01-06 15:38:43,581 run ERROR - Exception: [Errno 104] Connection reset by peer 
Jan 06 17:38:43 77a9730d13bb ecs-rc-admin-master-nrgi-queue.log 2021-01-06 15:38:43,581 run ERROR - Exception: [Errno 104] Connection reset by peer 
Jan 06 17:38:43 77a9730d13bb ecs-rc-admin-master-pdf-processor.log 2021-01-06 15:38:43,587 run DEBUG - ['Traceback (most recent call last):\n', '  File "/var/www/pdf-processor/run.py", line 35, in <module>\n    pdfProcessor.extractTextFromScannedDoc()\n', '  File "/var/www/pdf-processor/PdfProcessor.py", line 94, in extractTextFromScannedDoc\n    abbyyPdf.extractPages();\n', '  File "/var/www/pdf-processor/abbyy/AbbyyPdfTextExtractor.py", line 67, in extractPages\n    self.processPdfPage(page)\n', '  File "/var/www/pdf-processor/abbyy/AbbyyPdfTextExtractor.py", line 33, in processPdfPage\n    task = self.processor.ProcessImage(infile, settings)\n', '  File "/var/www/pdf-processor/abbyy/AbbyyOnlineSdk.py", line 52, in ProcessImage\n    response = self.getOpener().open(request, bodyParams).read()\n', '  File "/usr/lib/python2.7/urllib2.py", line 404, in open\n    response = self._open(req, data)\n', '  File "/usr/lib/python2.7/urllib2.py", line 422, in _open\n    \'_open\', req)\n', '  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain\n    result = func(*args)\n', '  File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open\n    return self.do_open(httplib.HTTPConnection, req)\n', '  File "/usr/lib/python2.7/urllib2.py", line 1187, in do_open\n    r = h.getresponse(buffering=True)\n', '  File "/usr/lib/python2.7/httplib.py", line 1089, in getresponse\n    response.begin()\n', '  File "/usr/lib/python2.7/httplib.py", line 444, in begin\n    version, status, reason = self._read_status()\n', '  File "/usr/lib/python2.7/httplib.py", line 400, in _read_status\n    line = self.fp.readline(_MAXLINE + 1)\n', '  File "/usr/lib/python2.7/socket.py", line 476, in readline\n    data = self._sock.recv(self._rbufsize)\n', 'error: [Errno 104] Connection reset by peer\n']

The processing issues seem to rise from network connectivity issues which pdf-processor does not handle that well. Ideally, it should switch to using the more modern requests python library and add timeouts and retries for API calls.

charlesyoung commented 3 years ago

Thanks @iprunache

@anjesh what is involved regarding analysing and addressing this issue?

charlesyoung commented 3 years ago

@anjesh what is involved regarding analysing and addressing this issue?

charlesyoung commented 3 years ago

The communication code needs to be revamped to use more modern libraries and the latest Abbyy API.

charlesyoung commented 3 years ago

70% completed.