Calamari-OCR / calamari

Line based ATR Engine based on OCRopy
Apache License 2.0
1.04k stars 209 forks source link

Run prediction without using multiprocessing #273

Closed timothydereuse closed 3 years ago

timothydereuse commented 3 years ago

I use Calamari's Python API as part of a larger application that schedules tasks using the Celery queue system. I've been upgrading our environment to use Calamari 2.1 (from 1.0), and I've been getting an error because of parallel processing in Calamari, as Celery tasks are not permitted to use the Python multiprocessing library.

The use of Calamari in our code is just these few lines (summarized here):

# 'ocr_model_paths` is a list of paths to trained OCR models
predictor = MultiPredictor.from_paths(checkpoints=ocr_model_paths, params=PredictorParams())

# 'strips' is a list of np arrays containing text lines
results = []
for r in predictor.predict_raw(strips):
    results.append(r)

This works perfectly fine when running not as a Celery task. When running as a Celery task, though, we get this error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/celery/app/trace.py", line 412, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/celery/app/trace.py", line 704, in __protected_call__
    return self.run(*args, **kwargs)
  File "/code/Rodan/rodan/jobs/base.py", line 771, in run
    retval = self.run_my_task(inputs, settings, arg_outputs)
  File "/code/Rodan/rodan/jobs/text_alignment/text_alignment.py", line 76, in run_my_task
    result = align.process(raw_image, transcript, model_name)
  File "/code/Rodan/rodan/jobs/text_alignment/align_to_ocr.py", line 83, in process
    all_chars = perform_ocr.recognize_text_strips(image, cc_strips, ocr_model_name, verbose)
  File "/code/Rodan/rodan/jobs/text_alignment/perform_ocr.py", line 94, in recognize_text_strips
    for r in predictor.predict_raw(strips):
  File "/usr/local/lib/python3.7/dist-packages/calamari_ocr/ocr/predict/predictor.py", line 114, in predict_pipeline
    for r in post_proc_pipeline.apply(results, run_parallel=False):
  File "/usr/local/lib/python3.7/dist-packages/tfaip/data/pipeline/processor/sample/processorpipeline.py", line 97, in _apply
    for x in super().apply(samples):
  File "/usr/local/lib/python3.7/dist-packages/tfaip/data/pipeline/processor/sample/processorpipeline.py", line 108, in _apply
    with parallel_pipeline as output_generator:
  File "/usr/local/lib/python3.7/dist-packages/tfaip/util/multiprocessing/data/pipeline.py", line 63, in __enter__
    maxtasksperchild=self.max_tasks_per_child,
  File "/usr/local/lib/python3.7/dist-packages/tfaip/util/multiprocessing/data/pool.py", line 60, in __init__
    super().__init__(initializer=Initializer(worker_constructor), **kwargs)
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 176, in __init__
    self._repopulate_pool()
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 241, in _repopulate_pool
    w.start()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 110, in start
    'daemonic processes are not allowed to have children'
AssertionError: daemonic processes are not allowed to have children

This was not an issue in Calamari 1.0, though I am not sure exactly what changed. In any case, speed is not a priority for us, and we do not need parallel processing of text lines. I have been trying to figure out if there is a way to set parameters within Calamari such that prediction does not use the multiprocessing library. In particular, https://github.com/Calamari-OCR/calamari/issues/263#issuecomment-860047394 mentions a way to disable the parallel pipeline, which I tried with these lines of code before running predict_raw():

predictor.data.params.pre_proc.run_parallel = False
predictor.data.params.post_proc.run_parallel = False

But this resulted in the same error. Is it possible at all to run predictions in Calamari without any use of multiprocessing? (I have also been digging through the tfaip source, which led me to believe that this might be possible, but I am not sure if I have to fork Calamari to make it happen or if there's an easier way I'm overlooking.)

ChWick commented 3 years ago

Hi Timothy!

How are you doing?

In principal, Calamari should be able to run without any parallelization. The commands required should indeed be:

predictor.data.params.pre_proc.run_parallel = False
predictor.data.params.post_proc.run_parallel = False

I will check, probably tomorrow, why this does not work and provide an update asap!

ChWick commented 3 years ago

Calamari 2.1.3 should fix this. Please let me know if this works for you!

timothydereuse commented 2 years ago

Took me a while to get to testing it thoroughly, but it looks like this did the trick! Thanks so much for doing this so quickly, Christoph.

ChWick commented 2 years ago

You are welcome!