OCR-D / ocrd_all

Master repository which includes most other OCR-D repositories as submodules
MIT License
71 stars 18 forks source link

RFC: Debian/Ubuntu packaging of ocrd_all components and OCR models #130

Open kba opened 4 years ago

kba commented 4 years ago

Now that a solution to the conflicting dependency problem is imminent, we should discuss how we can reduce build times and simplify management of OCR models by supporting OS package management.

I see three areas where package management can improve ocrd_all:

  1. Providing packages for processors with full dependencies, e.g. with AppImage as @stweil proposed.
  2. Providing packages for compile-intensiv packages, i.e. tesseract and olena
  3. Packaging models, like the GT4HistOCR-based ones, for tesseract, calamari, ocropy and kraken

Ad 1.: The only way this can work without creating system-wide dependency conflicts would be basically a repackaging of the maximum docker image. This is also of interest and AppImage is probably a good solution

Ad 2.: Since the scope is limited (tesseract and olena), @mikegerber has already built debian/ubuntu packages for olena and @AlexanderP builds tesseract for Launchpad's PPA, this would be relatively straightforward

Ad 3.: For tesseract models we can take the official tesseract-ocr-* models as a blueprint. ocropy and kraken models can also be packaged relatively easy. For calamari models, we should probably agree on a convention where and how models should be stored (ping @maxnth @andbue @chreul if you have already ideas/plans in that regard)

The model packaging in particular would be of benefit also outside the OCR-D "ecosphere".

My questions for the ocrd_all users/developers:

  1. Which of the three approaches are worth exploring in your opinion?
  2. Who has experience in Debian/Ubuntu packaging and can help with setting up the tooling necessary?
  3. How should we distribute the models? PPA seems like a straightforward choice but only supports Ubuntu (?) not Debian. Another proposal was https://packagecloud.io. Or could we build a repository as a GitHub pages static site or use GitHub releases as a pseudo-repository?

Feedback and pointers to solutions are very welcome.

mikegerber commented 4 years ago

Q&D ocrd AppImage to be built with pkg2appimage:

# Based on https://github.com/AppImage/pkg2appimage/blob/9249a99e653272416c8ee8f42cecdde12573ba3e/recipes/ProcDump.yml

app: ocrd

ingredients:
  dist: bionic
  sources:
    - deb http://us.archive.ubuntu.com/ubuntu/ bionic bionic-updates bionic-security main universe
    - deb http://us.archive.ubuntu.com/ubuntu/ bionic-updates main universe
    - deb http://us.archive.ubuntu.com/ubuntu/ bionic-security main universe
  packages:
    - python3.6-venv
  script:

script:
  - virtualenv --python=python3 usr
  - ./usr/bin/pip3 install ocrd
  - ./usr/bin/pip3 freeze | grep "^ocrd==" | cut -d "=" -f 3 > ../VERSION

  # XXX at least pkg2appimage needs a desktop file and an icon, might want to use something
  # else to build, but this is a POC, so...
  - mkdir -p usr/share/applications/
  - cat > usr/share/applications/ocrd.desktop <<\EOF
  - [Desktop Entry]
  - Name=ocrd
  - Exec=ocrd
  - Icon=ocrd
  - Comment=OCR-D core
  - Categories=Office;
  - Type=Application
  - Terminal=true
  - EOF
  - touch usr/share/icons/hicolor/512x512/apps/ocrd.png # FIXME
  - cp usr/share/icons/hicolor/512x512/apps/ocrd.png .
  - cp usr/share/applications/ocrd.desktop .

This has some quirks like .desktop and the icon and the handling of the working directory, but it was pleasingly easy to build this:

% ~/devel/app-image-ocrd/out/ocrd-2.12.2.glibc2.3.3-x86_64.AppImage workspace -d /tmp/actevedef_718448162 get-id 
http://resolver.staatsbibliothek-berlin.de/SBB00008F1000000000

(ugly bagit.py error message removed)

mikegerber commented 4 years ago

My opinion(!) on this:

If OCR-D has everything either

  1. pip installable (for Python source)
  2. apt installable on Ubuntu LTS (everything else) a. OCR-D things not covered by pip b. binary dependencies like Olena or Tesseract

then - with a little experience - it is easy to build and maintain dependency-isolated AppImages or Docker containers. I would aim for this situation.

This way it's possible to:

  1. Just put an AppImage into /usr/local/bin and have a working processor
  2. If you choose so, you can still have it wild and install everything "by hand"

Packaging everything into classical Ubuntu packages will produce the same Gordian knot of dependency problems as the original ocrd_all concept. (I call it Gordian knot because I am currently upgrading ocrd_calamari to TF2 and now need TF2.3 to solve some issues → I am sure some other processor will have issues with that.)

(There are some quirks with AppImage we should have a look at, but it looks really good.)

mikegerber commented 4 years ago

(My fat container approach https://travis-ci.org/github/mikegerber/my_ocrd_workflow has the same Gordian knot, I just include fewer processors.)

mikegerber commented 4 years ago

And you can then still stick an AppImage into a Ubuntu package. It's a bit perverse but easy to do.

(Needs a bit more work if you have e.g. a classical ocrd_olena package and then another one that includes everything as an AppImage.)