ICIJ / datashare

A self-hosted search engine for documents.
https://datashare.icij.org
GNU Affero General Public License v3.0
593 stars 53 forks source link

Preformance (indexing and OCR) problems #1506

Closed ALT-MOROSO closed 1 month ago

ALT-MOROSO commented 2 months ago

Hello ICIJ'team !

I have two question/problems today :

-When working on DATASHARE, both desktop and server mode, I figured out that Tesseract 's OCR is programmed to analyse 4 files per second. Do you know how to improve this limit ?

Thank you very much !!! Have a good one ! 👍

ClemDoum commented 2 months ago

Hi @ALT-MOROSO,

In order to speedup indexing you can use the --parallelism option documented here. It's also available in the settings for local mode.

Concerning how to speed up tesseract part, the answer is not straighforward.

Tesseract has multithreading capabilities as it can leverage OpenMP, however it's not easy to undertand if leveraging multithreading will benefit performance or not.

According to https://github.com/tesseract-ocr/tesseract/issues/3744 it's not clear if enabling multithreading will speed things up.

On the contrary, the tessdoc suggest that you could use the OMP_THREAD_LIMIT (and probably OMP_NUM_THREADS) env vars to use more threads and speed things up. Beware that adding more thread often has an overhead and at some point will become slower than running things with less threads.

So to sum up:

  1. try to augment Datashare parallelism using the --parallelism flag
  2. try to play with OMP_THREAD_LIMIT and OMP_NUM_THREADS (with no guaranty)
ALT-MOROSO commented 2 months ago

Thank you very much !!

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 40 days with no activity.