OCR-D / ocrd_all

Master repository which includes most other OCR-D repositories as submodules
MIT License
72 stars 17 forks source link

Builds without TensorFlow #147

Open stweil opened 4 years ago

stweil commented 4 years ago

Pre-build packages of the different version of TF don't exist for every platform (for example not with Python 3.8, not for non-Intel hosts). A typical error message which terminates make all looks like this:

ERROR: No matching distribution found for tensorflow-gpu==1.15.* (from ocrd-keraslm==0.3.2)

I think it would be good to support builds on such platforms, too, by skipping problematic modules, without requiring hacker methods like make all -k. Possible solutions:

The first one of these solution could be really user friendly, especially when it also shows an informative message.

bertsky commented 4 years ago

I wouldn't call make all -k a hack. It's exactly what you are asking for: keep building despite being unable to install some modules.

  • make all could detect whether required packages are available and skip modules which need unavailable packages.

That's just make all -k.

  • make all could support a macro to disable groups of modules (like for example all modules which depend on TF1). It is already possible to define the desired modules, but that requires much knowledge about the different modules.

Yes, but modules keep changing, and new ones get added. For some extra level of representation, someone would have to aggregate and keep managing all this extra knowledge.

stweil commented 4 years ago

I wouldn't call make all -k a hack. It's exactly what you are asking for: keep building despite being unable to install some modules.

make all -k is not the right solution, because it would waste build time for known problems and hide unknown problems.

Yes, but modules keep changing, and new ones get added. For some extra level of representation, someone would have to aggregate and keep managing all this extra knowledge.

We already aggregate the knowledge about modules which depend on TF.

bertsky commented 4 years ago

make all -k is not the right solution, because it would waste build time for known problems and hide unknown problems.

Good point.

We already aggregate the knowledge about modules which depend on TF.

True.

kba commented 4 years ago

make all could detect whether required packages are available and skip modules which need unavailable packages.

While I disagree with the idea in general (tensorflow should provide packages for Python 3.8 and different architectures! that cannot be so difficult, they have huge corporate backers!) but I see the need for a workaround and a solution that adresses this problem (as you propose) is preferable to brute-forcing with make -k.

stweil commented 4 years ago

tensorflow should provide packages for Python 3.8 and different architectures

"We don't add new versions of Python for old releases. If you want to use TF 1.15, you have to use Python 3.7 or lower. If you want to use Python 3.8 you have to use TF 2.2 or newer." (see source). So even TF 2.1 won't be supported with Python 3.7.

Unless all processors work with the latest TF, too, we'll have such problems.

bertsky commented 4 years ago

tensorflow should provide packages for Python 3.8 and different architectures

"We don't add new versions of Python for old releases. If you want to use TF 1.15, you have to use Python 3.7 or lower. If you want to use Python 3.8 you have to use TF 2.2 or newer." (see source). So even TF 2.1 won't be supported with Python 3.7.

Unless all processors work with the latest TF, too, we'll have such problems.

Wow. TF just keeps scaring users away on an epic scale. But why hasn't this seen more downvotes? (@stweil thanks for sharing!)

bertsky commented 4 years ago
* `make all` could support a macro to disable groups of modules (like for example all modules which depend on TF1). It is already possible to define the desired modules, but that requires much knowledge about the different modules.
  • make all could detect whether required packages are available and skip modules which need unavailable packages.

While I disagree with the idea in general (tensorflow should provide packages for Python 3.8 and different architectures! that cannot be so difficult, they have huge corporate backers!) but I see the need for a workaround and a solution that adresses this problem (as you propose) is preferable to brute-forcing with make -k.

So, do we

  1. make this automatic and narrow – checking for Python 3.8 only and automatically skipping all TF modules –, or
  2. manual and agnostic – offering a pre-defined (say) NO_TENSORFLOW so users can make all NO_TENSORFLOW=1?
stweil commented 4 years ago

In the light of the statement cited above and to cover not only Python 3.8 environments but also non-Intel hosts, I think we need separate checks for each sub-env. Ideally a build with Python 3.8 on Intel compatible hardware would then build processors which uses TF 2.2 or newer, but skip those depending on older TF versions. On ARM, all processors which depend on TF would be skipped unless there is a package source which provides TF packages optimized for the hardware. NVIDIA for example provides TF packages for their Jetson developer boards (of course again not all required versions of TF).

bertsky commented 4 years ago

In the light of the statement cited above and to cover not only Python 3.8 environments but also non-Intel hosts, I think we need separate checks for each sub-env. Ideally a build with Python 3.8 on Intel compatible hardware would then build processors which uses TF 2.2 or newer, but skip those depending on older TF versions. On ARM, all processors which depend on TF would be skipped unless there is a package source which provides TF packages optimized for the hardware. NVIDIA for example provides TF packages for their Jetson developer boards (of course again not all required versions of TF).

Ok, this all sounds like 1 – automatic, specific checks that disable modules which have no chance. Checking Python or TF version is simple, but checking for Intel-compatible hardware can be quite intricate (as can be seen in autoconf's host checking)...

stweil commented 4 years ago

The test is much simpler: just try to install one of the required Python packages (keras, tensorboard, tensorflow, ...).

stweil commented 4 years ago

Side note: I just noticed that Homebrew on macOS updated from Python 3.7 to 3.8 recently. Normally I'd say :-) With OCR-D I feel more like :-(

That increases the importance of this issue.

bertsky commented 4 years ago

The test is much simpler: just try to install one of the required Python packages (keras, tensorboard, tensorflow, ...).

You mean in the Makefile, bypassing the module's setup.py or Makefile? That would go against encapsulation. We'd have to double check the ocrd_all Makefile every time the module gets updated. Also, this would not actually save any installation time over the simple make all -k approach.

stweil commented 4 years ago

Yes, I mean in the Makefile. A sub-env for TensorFlow 1 like the current headless-tf1 can try to install a Python module for TensorFlow 1 without anything going more against encapsulation than we already have to do because of the known package conflicts. Trying to install a single module is much faster than installing lots of other Python modules from the different requirement lists before failing.

bertsky commented 4 years ago

A sub-env for TensorFlow 1 like the current headless-tf1 can try to install a Python module for TensorFlow 1 without anything going more against encapsulation than we already have to do because of the known package conflicts.

Agreed.

Trying to install a single module is much faster than installing lots of other Python modules from the different requirement lists before failing.

I don't think that difference will be noticable though. Torch, TF and Keras are the largest among the dependencies usually.

So our recipes would look like this:

$(call multirule,$(OCRD_COR_ASV_ANN)): cor-asv-ann
ifeq (0,$(MAKELEVEL))
    $(MAKE) -B -o $< $(notdir $(OCRD_COR_ASV_ANN))
    $(call delegate_venv,$(OCRD_COR_ASV_ANN))
$(OCRD_COR_ASV_ANN): VIRTUAL_ENV := $(SUB_VENV)/headless-tf1
else
    . $(ACTIVATE_VENV) && $(PIP) install tensorflow==1.15.2 opencv-python-headless
    $(pip_install)
endif
endif

But then we still need to run with make -k. How do you want to back out of the error?

stweil commented 4 years ago

our recipes would look like this

No, that would still try installing TF1 several times. The ideal solution would try it once if VIRTUAL_ENV points to headless-tf1, set a macro, and use that macro to enable / disable all parts which depend on TF1.

bertsky commented 4 years ago

our recipes would look like this

No, that would still try installing TF1 several times.

No it would not. Remember, we use multirule above, so this will be tried once per module (which can be nearly equated with once per venv).

The ideal solution would try it once if VIRTUAL_ENV points to headless-tf1, set a macro, and use that macro to enable / disable all parts which depend on TF1.

Huh? How do you set a macro dependent on the result of a recipe? Makefiles are executed in 2 passes, variable expansion and dependency tree comes before rule application.

You probably meant introducing another conditional into the rules (with a shell call). But that would not change how often the attempt is made. (It's definitely always tried next time make is invoked.) And you still need to back out of creating the delegating shell script if the sub-make failed.

kba commented 4 years ago

No, that would still try installing TF1 several times. The ideal solution would try it once if VIRTUAL_ENV points to headless-tf1, set a macro, and use that macro to enable / disable all parts which depend on TF1.

That sounds reasonable though I cannot quite imagine how that would work but am open to see a proof-of-concept. I would still see the merit in a NO_TENSORFLOW option for testing&development btw.

So our recipes would look like this:

$(call multirule,$(OCRD_COR_ASV_ANN)): cor-asv-ann
ifeq (0,$(MAKELEVEL))
  $(MAKE) -B -o $< $(notdir $(OCRD_COR_ASV_ANN))
  $(call delegate_venv,$(OCRD_COR_ASV_ANN))
$(OCRD_COR_ASV_ANN): VIRTUAL_ENV := $(SUB_VENV)/headless-tf1
else
  . $(ACTIVATE_VENV) && $(PIP) install tensorflow==1.15.2 opencv-python-headless
  $(pip_install)
endif
endif

But then we still need to run with make -k. How do you want to back out of the error?

Indeed, what to do next (honestly asking, finding it hard to grasp the consequences of these changes)?

stweil commented 2 years ago

That sounds reasonable though I cannot quite imagine how that would work but am open to see a proof-of-concept. I would still see the merit in a NO_TENSORFLOW option for testing&development btw.

That proof-of-concept is now realized in PR #295.

stweil commented 2 years ago

I just read the Dependabot alerts for cor-asv-ann. It lists 218 (!) alerts for tensorflow-gpu with 3 of them marked critical and 60 of them marked high. That's a nightmare. And most alerts are only fixed in tensorflow>2.5.2.

cneud commented 2 years ago

Since the situation with modules still requiring tf1 is indeed a nightmare, could we "soft" migrate those modules to tf2 using tf.compat.v1 and thoroughly test that for regressions?

bertsky commented 2 years ago

I just read the Dependabot alerts for cor-asv-ann. It lists 218 (!) alerts for tensorflow-gpu with 3 of them marked critical and 60 of them marked high. That's a nightmare. And most alerts are only fixed in tensorflow>2.5.2.

which is why these alerts are useless.

Since the situation with modules still requiring tf1 is indeed a nightmare, could we "soft" migrate those modules to tf2 using tf.compat.v1 and thoroughly test that for regressions?

@cneud, finding regressions would be the hard thing to do in my case. cor-asv-ann / ocrd_keraslm / ocrd_segment do not have regression tests. (And numerical differences could be benign or not.) What's more, I would almost certainly have to repeat parts of the training process – a lot of work.

Also, are you sure it's just that first step of replacing TF1 API with compat.v1? In the migration guide the state:

  1. Run the automated script to convert your TF1.x API usage to tf.compat.v1.
  2. Remove old tf.contrib.layers and replace them with TF Slim symbols. Also check TF Addons for other tf.contrib symbols.
  3. Rewrite your TF1.x model forward passes to run in TF2 with eager execution enabled.
  4. Validate the accuracy and numerical correctness of your migrated code.
  5. Upgrade your training, evaluation and model saving code to TF2 equivalents.
  6. (Optional) Migrate your TF2-compatible tf.compat.v1 APIs including TF Slim usage to idiomatic TF2 APIs.

It's still not entirely clear to me under which circumstances the other steps apply...

cneud commented 2 years ago

It's still not entirely clear to me under which circumstances the other steps apply

Indeed, this is mainly what I also wonder about - how can we better assess the effort for migration to tf2? I suppose we have to make a start with migrating to tf2 at some point (also for @QURATOR-SPK), so what could be initial baby steps on the way. The issues with tf1 builds and Python 3.8 will probably also become worse over time. Maybe we can make a common effort between devs of the modules that require tf1, or plan that in for some time in the foreseeable future.

bertsky commented 2 years ago

Agreed! So in the end we'll just have to start trying – and back every step of the way up by some tests at least.

BTW, for 3.8 we have a solution already (see #289, which uses nvidia-tensorflow prebuilds). But 3.9+ is certainly never going to be supported by anyone. And it also became apparent recently that more and more dependencies of TF1 now start conflicting with other packages: https://github.com/OCR-D/ocrd_all/blob/63d13e4879ed0854f0cd7fb43e6a3819b9f7af14/Makefile#L634-L636

cneud commented 2 years ago

we'll just have to start trying – and back every step of the way up

Exactly, those conflicts and workarounds require discussion, fixes, testing as well and thereby more and more time too. Maybe trying out with most simple means can help weighing the cost of a migration vs this better.