Open stweil opened 4 years ago
I wouldn't call make all -k
a hack. It's exactly what you are asking for: keep building despite being unable to install some modules.
make all
could detect whether required packages are available and skip modules which need unavailable packages.
That's just make all -k
.
make all
could support a macro to disable groups of modules (like for example all modules which depend on TF1). It is already possible to define the desired modules, but that requires much knowledge about the different modules.
Yes, but modules keep changing, and new ones get added. For some extra level of representation, someone would have to aggregate and keep managing all this extra knowledge.
I wouldn't call make all -k a hack. It's exactly what you are asking for: keep building despite being unable to install some modules.
make all -k
is not the right solution, because it would waste build time for known problems and hide unknown problems.
Yes, but modules keep changing, and new ones get added. For some extra level of representation, someone would have to aggregate and keep managing all this extra knowledge.
We already aggregate the knowledge about modules which depend on TF.
make all -k
is not the right solution, because it would waste build time for known problems and hide unknown problems.
Good point.
We already aggregate the knowledge about modules which depend on TF.
True.
make all
could detect whether required packages are available and skip modules which need unavailable packages.
While I disagree with the idea in general (tensorflow should provide packages for Python 3.8 and different architectures! that cannot be so difficult, they have huge corporate backers!) but I see the need for a workaround and a solution that adresses this problem (as you propose) is preferable to brute-forcing with make -k
.
tensorflow should provide packages for Python 3.8 and different architectures
"We don't add new versions of Python for old releases. If you want to use TF 1.15, you have to use Python 3.7 or lower. If you want to use Python 3.8 you have to use TF 2.2 or newer." (see source). So even TF 2.1 won't be supported with Python 3.7.
Unless all processors work with the latest TF, too, we'll have such problems.
tensorflow should provide packages for Python 3.8 and different architectures
"We don't add new versions of Python for old releases. If you want to use TF 1.15, you have to use Python 3.7 or lower. If you want to use Python 3.8 you have to use TF 2.2 or newer." (see source). So even TF 2.1 won't be supported with Python 3.7.
Unless all processors work with the latest TF, too, we'll have such problems.
Wow. TF just keeps scaring users away on an epic scale. But why hasn't this seen more downvotes? (@stweil thanks for sharing!)
* `make all` could support a macro to disable groups of modules (like for example all modules which depend on TF1). It is already possible to define the desired modules, but that requires much knowledge about the different modules.
make all
could detect whether required packages are available and skip modules which need unavailable packages.While I disagree with the idea in general (tensorflow should provide packages for Python 3.8 and different architectures! that cannot be so difficult, they have huge corporate backers!) but I see the need for a workaround and a solution that adresses this problem (as you propose) is preferable to brute-forcing with
make -k
.
So, do we
NO_TENSORFLOW
so users can make all NO_TENSORFLOW=1
?In the light of the statement cited above and to cover not only Python 3.8 environments but also non-Intel hosts, I think we need separate checks for each sub-env. Ideally a build with Python 3.8 on Intel compatible hardware would then build processors which uses TF 2.2 or newer, but skip those depending on older TF versions. On ARM, all processors which depend on TF would be skipped unless there is a package source which provides TF packages optimized for the hardware. NVIDIA for example provides TF packages for their Jetson developer boards (of course again not all required versions of TF).
In the light of the statement cited above and to cover not only Python 3.8 environments but also non-Intel hosts, I think we need separate checks for each sub-env. Ideally a build with Python 3.8 on Intel compatible hardware would then build processors which uses TF 2.2 or newer, but skip those depending on older TF versions. On ARM, all processors which depend on TF would be skipped unless there is a package source which provides TF packages optimized for the hardware. NVIDIA for example provides TF packages for their Jetson developer boards (of course again not all required versions of TF).
Ok, this all sounds like 1 – automatic, specific checks that disable modules which have no chance. Checking Python or TF version is simple, but checking for Intel-compatible hardware can be quite intricate (as can be seen in autoconf's host checking)...
The test is much simpler: just try to install one of the required Python packages (keras, tensorboard, tensorflow, ...).
Side note: I just noticed that Homebrew on macOS updated from Python 3.7 to 3.8 recently. Normally I'd say :-) With OCR-D I feel more like :-(
That increases the importance of this issue.
The test is much simpler: just try to install one of the required Python packages (keras, tensorboard, tensorflow, ...).
You mean in the Makefile, bypassing the module's setup.py
or Makefile
? That would go against encapsulation. We'd have to double check the ocrd_all Makefile every time the module gets updated. Also, this would not actually save any installation time over the simple make all -k
approach.
Yes, I mean in the Makefile. A sub-env for TensorFlow 1 like the current headless-tf1
can try to install a Python module for TensorFlow 1 without anything going more against encapsulation than we already have to do because of the known package conflicts. Trying to install a single module is much faster than installing lots of other Python modules from the different requirement lists before failing.
A sub-env for TensorFlow 1 like the current
headless-tf1
can try to install a Python module for TensorFlow 1 without anything going more against encapsulation than we already have to do because of the known package conflicts.
Agreed.
Trying to install a single module is much faster than installing lots of other Python modules from the different requirement lists before failing.
I don't think that difference will be noticable though. Torch, TF and Keras are the largest among the dependencies usually.
So our recipes would look like this:
$(call multirule,$(OCRD_COR_ASV_ANN)): cor-asv-ann
ifeq (0,$(MAKELEVEL))
$(MAKE) -B -o $< $(notdir $(OCRD_COR_ASV_ANN))
$(call delegate_venv,$(OCRD_COR_ASV_ANN))
$(OCRD_COR_ASV_ANN): VIRTUAL_ENV := $(SUB_VENV)/headless-tf1
else
. $(ACTIVATE_VENV) && $(PIP) install tensorflow==1.15.2 opencv-python-headless
$(pip_install)
endif
endif
But then we still need to run with make -k
. How do you want to back out of the error?
our recipes would look like this
No, that would still try installing TF1 several times. The ideal solution would try it once if VIRTUAL_ENV
points to headless-tf1
, set a macro, and use that macro to enable / disable all parts which depend on TF1.
our recipes would look like this
No, that would still try installing TF1 several times.
No it would not. Remember, we use multirule
above, so this will be tried once per module (which can be nearly equated with once per venv).
The ideal solution would try it once if
VIRTUAL_ENV
points toheadless-tf1
, set a macro, and use that macro to enable / disable all parts which depend on TF1.
Huh? How do you set a macro dependent on the result of a recipe? Makefiles are executed in 2 passes, variable expansion and dependency tree comes before rule application.
You probably meant introducing another conditional into the rules (with a shell
call). But that would not change how often the attempt is made. (It's definitely always tried next time make
is invoked.) And you still need to back out of creating the delegating shell script if the sub-make failed.
No, that would still try installing TF1 several times. The ideal solution would try it once if VIRTUAL_ENV points to headless-tf1, set a macro, and use that macro to enable / disable all parts which depend on TF1.
That sounds reasonable though I cannot quite imagine how that would work but am open to see a proof-of-concept. I would still see the merit in a NO_TENSORFLOW
option for testing&development btw.
So our recipes would look like this:
$(call multirule,$(OCRD_COR_ASV_ANN)): cor-asv-ann ifeq (0,$(MAKELEVEL)) $(MAKE) -B -o $< $(notdir $(OCRD_COR_ASV_ANN)) $(call delegate_venv,$(OCRD_COR_ASV_ANN)) $(OCRD_COR_ASV_ANN): VIRTUAL_ENV := $(SUB_VENV)/headless-tf1 else . $(ACTIVATE_VENV) && $(PIP) install tensorflow==1.15.2 opencv-python-headless $(pip_install) endif endif
But then we still need to run with
make -k
. How do you want to back out of the error?
Indeed, what to do next (honestly asking, finding it hard to grasp the consequences of these changes)?
That sounds reasonable though I cannot quite imagine how that would work but am open to see a proof-of-concept. I would still see the merit in a
NO_TENSORFLOW
option for testing&development btw.
That proof-of-concept is now realized in PR #295.
I just read the Dependabot alerts for cor-asv-ann
. It lists 218 (!) alerts for tensorflow-gpu
with 3 of them marked critical and 60 of them marked high. That's a nightmare. And most alerts are only fixed in tensorflow>2.5.2
.
Since the situation with modules still requiring tf1 is indeed a nightmare, could we "soft" migrate those modules to tf2 using tf.compat.v1
and thoroughly test that for regressions?
I just read the Dependabot alerts for
cor-asv-ann
. It lists 218 (!) alerts fortensorflow-gpu
with 3 of them marked critical and 60 of them marked high. That's a nightmare. And most alerts are only fixed intensorflow>2.5.2
.
which is why these alerts are useless.
Since the situation with modules still requiring tf1 is indeed a nightmare, could we "soft" migrate those modules to tf2 using
tf.compat.v1
and thoroughly test that for regressions?
@cneud, finding regressions would be the hard thing to do in my case. cor-asv-ann / ocrd_keraslm / ocrd_segment do not have regression tests. (And numerical differences could be benign or not.) What's more, I would almost certainly have to repeat parts of the training process – a lot of work.
Also, are you sure it's just that first step of replacing TF1 API with compat.v1? In the migration guide the state:
- Run the automated script to convert your TF1.x API usage to tf.compat.v1.
- Remove old tf.contrib.layers and replace them with TF Slim symbols. Also check TF Addons for other tf.contrib symbols.
- Rewrite your TF1.x model forward passes to run in TF2 with eager execution enabled.
- Validate the accuracy and numerical correctness of your migrated code.
- Upgrade your training, evaluation and model saving code to TF2 equivalents.
- (Optional) Migrate your TF2-compatible tf.compat.v1 APIs including TF Slim usage to idiomatic TF2 APIs.
It's still not entirely clear to me under which circumstances the other steps apply...
It's still not entirely clear to me under which circumstances the other steps apply
Indeed, this is mainly what I also wonder about - how can we better assess the effort for migration to tf2? I suppose we have to make a start with migrating to tf2 at some point (also for @QURATOR-SPK), so what could be initial baby steps on the way. The issues with tf1 builds and Python 3.8 will probably also become worse over time. Maybe we can make a common effort between devs of the modules that require tf1, or plan that in for some time in the foreseeable future.
Agreed! So in the end we'll just have to start trying – and back every step of the way up by some tests at least.
BTW, for 3.8 we have a solution already (see #289, which uses nvidia-tensorflow prebuilds). But 3.9+ is certainly never going to be supported by anyone. And it also became apparent recently that more and more dependencies of TF1 now start conflicting with other packages: https://github.com/OCR-D/ocrd_all/blob/63d13e4879ed0854f0cd7fb43e6a3819b9f7af14/Makefile#L634-L636
we'll just have to start trying – and back every step of the way up
Exactly, those conflicts and workarounds require discussion, fixes, testing as well and thereby more and more time too. Maybe trying out with most simple means can help weighing the cost of a migration vs this better.
Pre-build packages of the different version of TF don't exist for every platform (for example not with Python 3.8, not for non-Intel hosts). A typical error message which terminates
make all
looks like this:I think it would be good to support builds on such platforms, too, by skipping problematic modules, without requiring hacker methods like
make all -k
. Possible solutions:make all
could detect whether required packages are available and skip modules which need unavailable packages.make all
could support a macro to disable groups of modules (like for example all modules which depend on TF1). It is already possible to define the desired modules, but that requires much knowledge about the different modules.The first one of these solution could be really user friendly, especially when it also shows an informative message.