OCR-D / ocrd_all

Master repository which includes most other OCR-D repositories as submodules
MIT License
72 stars 17 forks source link

Remove headless-torch14 and headless-tf2 #314

Closed stweil closed 2 years ago

stweil commented 2 years ago

The docker image size is reduced from 32.8 GB to 29.9 GB. I also expect a reduced build time.

stweil commented 2 years ago

Beside the main virtual environment now only a single 2nd virtual environment headless-tf1 is used for all submodules which require TensorFlow 1. These additional modules were also moved into headless-tf1:

stweil commented 2 years ago

A test run with docker-maximum-cuda-git just finished successfully, see build log.

stweil commented 2 years ago

CircleCI now finished successfully in 38:15 min. Previous CI runs for pull requests used more than 42 min, so it looks like the build time was reduced by about 10 %.

stweil commented 2 years ago

@bertsky, @kba, do you need more time for the review or can I merge this PR?

bertsky commented 2 years ago
  • ocrd_detectron2. It requires detectron2 from NVIDIA which uses a special version of torch from the same source. A local build on our OCR server tries to install CUDA_VERSION 11.2 which does not exist, so the build fails unless 11.3 is required explicitly. Should ocrd_all enforce CUDA_VERSION 11.3? Would it be possible/desired to use the same torch for all submodules?

First of all, Detectron2 is from Facebook Research, not Nvidia. Also, ocrd_detectron2 does not install a non-standard Pytorch distribution, but tries to fetch the wheels matching the system's CUDA version from Pytorch.org. (That's just a soft dependency – if no matching wheels are found there, then pip will fetch whatever it deems best from PyPI.)

Note also that this is not by accident: just as with TF, you must get a Pytorch matching your system's CUDA version, otherwise it won't run (not even in CPU mode). Because we want to support a wide range of systems (i.e. of CUDA versions, as well as CPU-only), we invest some effort here to determine the current CUDA version.

Lastly, wheels for Detectron2 depend on both the Pytorch and CUDA version. There simply is no PyPI release at all (by design). Building from source seems a waste of resources to me – but I might add that as a fallback option to ocrd_detectron2's Makefile in the future.

Regarding your build failure, please open an issue with ocrd_detectron2, providing details like logs.

stweil commented 2 years ago

First of all, Detectron2 is from Facebook Research, not Nvidia. Also, ocrd_detectron2 does not install a non-standard Pytorch distribution, but tries to fetch the wheels matching the system's CUDA version from Pytorch.org. (That's just a soft dependency – if no matching wheels are found there, then pip will fetch whatever it deems best from PyPI.)

... and will fail because of a missing shared NVIDIA library as soon as you run the executable. The installation of ocrd_detectron2 sets additional search paths for pip and gets not only detectron2 but also torch (and other modules) from there. A standard torch can be installed by other OCR-D processors, but that breaks ocrd_detectron2 (see GitHub action build log). Thanks for the hint to Facebook Research. I fixed my comment above.

Note also that this is not by accident: just as with TF, you must get a Pytorch matching your system's CUDA version, otherwise it won't run (not even in CPU mode). Because we want to support a wide range of systems (i.e. of CUDA versions, as well as CPU-only), we invest some effort here to determine the current CUDA version.

Regarding your build failure, please open an issue with ocrd_detectron2, providing details like logs.

Then it is impossible to use the NVIDIA libraries which are provided with Debian / Ubuntu with ocrd_detectron2 because they have CUDA_VERSION 11.2 which is unsupported by Facebook Research. This looks like the beginning of a new nightmare similar to the TF 1 one, but here there is at least a possible fallback to CPU (not so nice). You noticed that already yourself in https://github.com/bertsky/ocrd_detectron2/issues/7. I now added a comment there for the build failure with CUDA version 11.2.

stweil commented 2 years ago

Also, I think the clean rule needs an update. Especially since the paths have been renamed a few times recently. (So the update must include both the old defaults and the current variables.)

This is unrelated to the PR here.

We discussed already whether the clean target should also handle former defaults and obviously have different opinions. I don't think that it is necessary to handle old defaults and vote against blowing up the Makefile with rules for them. There is always a very simple solution for a real clean: just remove anything and make a fresh build.

bertsky commented 2 years ago

Also, I think the clean rule needs an update. Especially since the paths have been renamed a few times recently. (So the update must include both the old defaults and the current variables.)

This is unrelated to the PR here.

It is perfectly related. You change venvs here. Thus,

  1. the rule does not clean the current VIRTUAL_ENV anymore.
  2. it does not catch the state which users had up to this point anymore.

We discussed already whether the clean target should also handle former defaults and obviously have different opinions.

I don't recall you were opposed to that notion. On what grounds? Obviously we previously decided to do it this way, nevertheless.

I don't think that it is necessary to handle old defaults and vote against blowing up the Makefile with rules for them.

You can hardly call a few more shell globs "blowing up the Makefile".

There is always a very simple solution for a real clean: just remove anything and make a fresh build.

The point is simply that users do not know what to delete, or forget about details. True even for advanced users (you cannot think of everything all the time).

stweil commented 2 years ago

It is perfectly related. You change venvs here.

I removed two virtual environments. make clean still removes any existing venv (also the old ones).