Preformance (indexing and OCR) problems

ALT-MOROSO commented 2 months ago

Hello ICIJ'team !

I have two question/problems today :

-When working on DATASHARE, both desktop and server mode, I figured out that Tesseract 's OCR is programmed to analyse 4 files per second. Do you know how to improve this limit ?

After having installed DATASAHRE on a good Windows config (core i7,16GO RAM, 2,3gz, SSD) higher that the one mentioned in your doc', I did not notice any boost in performance with my previous config wich was less powerful. Do you know how to bost the indexing and OCR tasks ?

Thank you very much !!! Have a good one ! 👍

ClemDoum commented 2 months ago

Hi @ALT-MOROSO,

In order to speedup indexing you can use the --parallelism option documented here. It's also available in the settings for local mode.

Concerning how to speed up tesseract part, the answer is not straighforward.

Tesseract has multithreading capabilities as it can leverage OpenMP, however it's not easy to undertand if leveraging multithreading will benefit performance or not.

According to https://github.com/tesseract-ocr/tesseract/issues/3744 it's not clear if enabling multithreading will speed things up.

On the contrary, the tessdoc suggest that you could use the OMP_THREAD_LIMIT (and probably OMP_NUM_THREADS) env vars to use more threads and speed things up. Beware that adding more thread often has an overhead and at some point will become slower than running things with less threads.

So to sum up:

try to augment Datashare parallelism using the --parallelism flag
try to play with OMP_THREAD_LIMIT and OMP_NUM_THREADS (with no guaranty)

ALT-MOROSO commented 2 months ago

Thank you very much !!

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 40 days with no activity.

ICIJ / datashare

Preformance (indexing and OCR) problems #1506