Multipage Calvo Trainer failed in Rodan-staging with 5 images

carrieeex commented 3 years ago

I was trying to run the Multipage Calvo Trainer (Training model for Patchwise Analysis of Music Document) in Rodan-staging, with 5 images inputs and each image has 3 rgba - layer inputs: Layer 0 (background),Layer 1, Selected Regions that comes from the Pixel.js job in another workflow (all files related are attached below). It failed with the following error:

Error summary: InvalidArgumentError: output dimensions must be positive [[node functional_3/up_sampling2d/resize/ResizeNearestNeighbor (defined at code/Rodan/rodan/jobs/Calvo_classifier/training_engine_sae.py:227) ]] [Op:__inference_train_function_84115] Function call stack: train_function

The error details are:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/celery/app/trace.py", line 412, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/celery/app/trace.py", line 704, in __protected_call__
    return self.run(*args, **kwargs)
  File "/code/Rodan/rodan/jobs/base.py", line 771, in run
    retval = self.run_my_task(inputs, settings, arg_outputs)
  File "/code/Rodan/rodan/jobs/Calvo_classifier/fast_calvo_trainer.py", line 186, in run_my_task
    batch_size=batch_size,
  File "/code/Rodan/rodan/jobs/Calvo_classifier/training_engine_sae.py", line 227, in train_msae
    epochs=epochs,
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 1098, in fit
    tmp_logs = train_function(iterator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 807, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2829, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
    cancellation_manager=cancellation_manager)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 550, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError:  output dimensions must be positive
     [[node functional_3/up_sampling2d/resize/ResizeNearestNeighbor (defined at code/Rodan/rodan/jobs/Calvo_classifier/training_engine_sae.py:227) ]] [Op:__inference_train_function_84115]

Function call stack:
train_function

To replicate this issue:

The workflow I used looks like: where the input ports are image, Layer 0 (background),Layer 1, Selected Regions (each has five), trying with Salzinnes folios 006r, 066v, 106r, 166v, A06r, which can be found in my project in Rodan-staging (shared with devs) or here.

The setting for the Calvo Trainer was: Maximum number of samples per label: 100 Patch width: 32 Patch height: 32 Maximum number of training epochs: 5 Batch Size: 1

carrieeex commented 3 years ago

The inputs need to be assigned in order, which looks similar to: diagram for testing calvo trainer in staging (port ordering) (thanks to @martha-thomae for the screenshot!)

carrieeex commented 3 years ago

For the devs (@kemalkongar @raviraina @GabbyHalpin) who will look into this: The project I shared in Rodan-staging is TEST_staging_fromJiali, the workflow I used was Calvo Trainer 5 images. (You could also try with workflow Patchwise (5) TRY random inputs order, it's the same with additional labeler and 5 PNG jobs for the image inputs. I've tried to run it with the exact same inputs, but it keeps processing and seems never end.)

The workflow run that failed is named as Patchwise (5) 006r, 066v, 106r, 166v, A06r, and the one that keeps processing is Patchwise (5) TRY 2.0.

For the inputs, image is the resized image; Layer 0 (background) is the NonPageLayer, Layer 1 is the PageLayer, and Selected regions is SelectedLayer in the rescources.

kemalkongar commented 3 years ago

I will start looking into this as soon as HPC Fast Trainer is stable, thanks for the detailed issue.

carrieeex commented 3 years ago

Note: @martha-thomae has tried the same job (Multipage Calvo Trainer) with 2 images and their layers, and it finished (so it works).

kemalkongar commented 3 years ago

@deepio @napulen It may be a better idea to try to implement OrderedDict in Rodan, assuming it's a relatively easy (1-2 day) task rather than try to debug this and hope there isn't any human error. Because I can assure you, I will make at least 1 mistake testing this with 5 inputs, given the shifting names.

carrieeex commented 3 years ago

@deepio @napulen It may be a better idea to try to implement OrderedDict in Rodan, assuming it's a relatively easy (1-2 day) task rather than try to debug this and hope there isn't any human error. Because I can assure you, I will make at least 1 mistake testing this with 5 inputs, given the shifting names.

I agree! Assuring the inputs was time-consuming. The switched order input issue is here: https://github.com/DDMAL/Rodan/issues/615.

napulen commented 3 years ago

@deepio @napulen It may be a better idea to try to implement OrderedDict in Rodan, assuming it's a relatively easy (1-2 day) task rather than try to debug this and hope there isn't any human error. Because I can assure you, I will make at least 1 mistake testing this with 5 inputs, given the shifting names.

I don't expect that doing a ctrl+h of dict()s into OrderedDict()s to make any noise or create any problem. Maybe the hardest issue is to find all instances of dictionaries so that you don't accidentally leave some unordered dictionaries throughout.

Maybe, maybe some issues related to serialization could come up. Hopefully OrderedDicts are also serializable and will replace dicts without issue.

From a library perspective, OrderedDicts need no additional external packages (pip installs), just additional imports. No objections on my end to add those.

kemalkongar commented 3 years ago

Then we'll look into this next week (please bring it up at the scrum since I'll be gone!). I've also read that OrderedDict may be slightly inefficient in Python 2 but it requires further reading.

napulen commented 3 years ago

@timothydereuse is leading the next scrum, I think

DDMAL / Calvo_classifier

Multipage Calvo Trainer failed in Rodan-staging with 5 images #55