OCR-D / ocrd_all

Master repository which includes most other OCR-D repositories as submodules
MIT License
71 stars 18 forks source link

Basic Releasemanagement #101

Closed M3ssman closed 4 years ago

M3ssman commented 4 years ago

As an user of OCR-D-Tools it is the easiest approach to run the Tools the containerized way.

For a production environment release builds of ocrd-all-containers are a key requirement for stable workflows.

By now, the situation is as follows: If an OCR-D-Container is pulled one week later on a different system, it must be possible to pull exactly the same version than it was the time before on the first system. Without tagging the container images, this can't be guaranteed, since they implicit tagged as latest.

Any common CI/CD-Plattform has the ability to run post-commit actions, so has CircleCI.

M3ssman commented 4 years ago

The current container naming doesn't comply to common image naming conventions, as the part after the double colon is usually used for the tag:

ocrd/all:maximum => vendor/product:tag => ocrd/maximum:2020.05.26

M3ssman commented 4 years ago

Another point that makes me wonder: Why 6 image flavors? Having a small one and one really maximum should do the job. Please keep in mind, that more products blow continuous deliverance and need additional documentation, too.

stweil commented 4 years ago

As a user of OCR-D-Tools it is the easiest approach to run the Tools the containerized way.

That depends. If you run OCR-D on a limited number of hosts, building it locally might be the better choice. At least I nearly never use containers for OCR-D. That allows easy local tagging of the installed release and parallel installations of different releases using virtual environments. For some hosts (ARM, PowerPC) there exist no containers at all.

bertsky commented 4 years ago

Another point that makes me wonder: Why 6 image flavors? Having a small one and one really maximum should do the job.

I must object. We have 3x2 flavours of pre-built images precisely because different users have different needs. Every module pre-selection (mini/medi/maxi-mum) has a different trade-off between costs (disk size and network bandwidth) and benefits (processor and workflow options available). And there's an orthogonal distinction between binary-only ("wait for next update") and source+binary ("autonomous mini-update") delivery. This flexibility seems natural for a complex federated system like OCR-D.

Please keep in mind, that more products blow continuous deliverance and need additional documentation, too.

All this has been solved for automatic continous integration and deployment long ago. Even documentation is in a respectable state nowadays. (If you think otherwise, please don't hesitate to open issues specifically!)

If an OCR-D-Container is pulled one week later on a different system, it must be possible to pull exactly the same version than it was the time before on the first system.

I agree that there is a need for easier selection of container versions, at least documentation of it. (We do have versioning already, which can give you reproducible installation, it's just not well known/visible.)

But what you are looking for is not changing the tag (comparable to a VCS branch), but the digest (which is like a VCS revision).

You can pull by digest and run by digest. To view digests (not normally shown) use docker images --digest. To pull/run use the image@digest notation.

EDIT: Also, you can always inspect an image for its org.label-schema.vcs-ref and org.label-schema.build-date.

$ docker image inspect ocrd/all:maximum-git
[
    {
        "Id": "sha256:708f495984de66012cc5f46cf055eacd9d4a533a2f86118d9159e773dfe38c49",
        "RepoTags": [
            "ocrd/all:maximum-git"
        ],
        "RepoDigests": [
            "ocrd/all@sha256:b39b378b17cb1118e7f95f41320abc10e3ced554860f26799ec8457524977c74"
        ],
        "Parent": "",
        "Comment": "",
        "Created": "2020-05-14T20:37:13.172748238Z",
        "Container": "d4616dfa82f1a90c26bfbeee850f3f46836cba01c7e0ebeb066a5bcefee2b41b",
        "ContainerConfig": {
            "Hostname": "d4616dfa82f1",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "DEBIAN_FRONTEND=teletype",
                "PYTHONIOENCODING=utf8",
                "LC_ALL=C.UTF-8",
                "LANG=C.UTF-8",
                "PREFIX=/usr/local",
                "VIRTUAL_ENV=/usr/local",
                "OCRD_MODULES=cor-asv-ann core dinglehopper format-converters ocrd_anybaseocr ocrd_calamari ocrd_cis ocrd_fileformat ocrd_im6convert ocrd_keraslm ocrd_olena ocrd_pagetopdf ocrd_pc_segmentation ocrd_repair_inconsistencies ocrd_segment ocrd_tesserocr ocrd_typegroups_classifier sbb_textline_detector tesseract tesserocr workflow-configuration",
                "PIP_OPTIONS=--timeout=3000 -e"
            ],
            "Cmd": [
                "/bin/sh",
                "-c",
                "#(nop) ",
                "CMD [\"/bin/sh\" \"-c\" \"/bin/bash\"]"
            ],
            "ArgsEscaped": true,
            "Image": "sha256:987ce43cceaa1dbe4a741190b5fd4d61ef078c94d3b5372cc92076c70d29a2da",
            "Volumes": {
                "/data": {}
            },
            "WorkingDir": "/data",
            "Entrypoint": null,
            "OnBuild": [],
            "Labels": {
                "maintainer": "https://ocr-d.de/kontakt",
                "org.label-schema.build-date": "2020-05-14T20:07:32Z",
                "org.label-schema.vcs-ref": "fec3888",
                "org.label-schema.vcs-url": "https://github.com/OCR-D/ocrd_all"
            }
        },
        "DockerVersion": "17.09.0-ce",
        "Author": "",
        "Config": {
            "Hostname": "",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "DEBIAN_FRONTEND=teletype",
                "PYTHONIOENCODING=utf8",
                "LC_ALL=C.UTF-8",
                "LANG=C.UTF-8",
                "PREFIX=/usr/local",
                "VIRTUAL_ENV=/usr/local",
                "OCRD_MODULES=cor-asv-ann core dinglehopper format-converters ocrd_anybaseocr ocrd_calamari ocrd_cis ocrd_fileformat ocrd_im6convert ocrd_keraslm ocrd_olena ocrd_pagetopdf ocrd_pc_segmentation ocrd_repair_inconsistencies ocrd_segment ocrd_tesserocr ocrd_typegroups_classifier sbb_textline_detector tesseract tesserocr workflow-configuration",
                "PIP_OPTIONS=--timeout=3000 -e"
            ],
            "Cmd": [
                "/bin/sh",
                "-c",
                "/bin/bash"
            ],
            "ArgsEscaped": true,
            "Image": "sha256:987ce43cceaa1dbe4a741190b5fd4d61ef078c94d3b5372cc92076c70d29a2da",
            "Volumes": {
                "/data": {}
            },
            "WorkingDir": "/data",
            "Entrypoint": null,
            "OnBuild": [],
            "Labels": {
                "maintainer": "https://ocr-d.de/kontakt",
                "org.label-schema.build-date": "2020-05-14T20:07:32Z",
                "org.label-schema.vcs-ref": "fec3888",
                "org.label-schema.vcs-url": "https://github.com/OCR-D/ocrd_all"
            }
        },
        "Architecture": "amd64",
        "Os": "linux",
        "Size": 7238296562,
        "VirtualSize": 7238296562,
        ...
        "Metadata": {
            "LastTagTime": "0001-01-01T00:00:00Z"
        }
    }
]
M3ssman commented 4 years ago

First, I want to point out, that I do not want to question the actual possibility to reconstruct a specific version of the ocrd_all plattform at all. From a "custumer", from an end-user perspective it is just way too complex. It's not, that I couldn't do; I just don't want to take time to do so, when I know it can be done in CI/CD-workflows.

Never send a human to do a machine's job...

I do believe the OCR-D-plattform can be part of digitalisation workflows, and that public money should go to public projects. The wide adoption will be tight to it's easy integration into existing workflows. When our directress asks next time: "Why not keep existing services? ABBYY is working fine!" I need arguments like: "Because we can scale OCR-D-platforms in no-time, we just have to take care for OCR-D-Makefiles and integrations!". At the ULB, we are served by the University IT-Center, which virtually wants to have every service inside Containers (and aims for Kubernetes in the long run).

Therefore we must find a balance between a rapidly evolving system and a stable platform that adds great benefits to existing workflows.

From an end-user perspective it would be even sufficient to have only a single container image to take care for. If we go for OCR, we want the best OCR possible, don't we? Cropping (with anybaseocr) is suggested anywhere, so why should one go for a distribution that lacks this ability? Using only tesserocr's segmentation has been a bad idea (at least this is what I got advised from OCR-Dians several times). So what can the minimum distribution be good for a real-world library? Everybody who really needs a special flavor of ocrd_all is free to clone the project local and the submodules as required. You don't need a official container image for this, and, arguing further this direction, then you don't need a official distribution at all, because you could build your own OCR-D-platform.

Again, I do not want to disrespect all the hard work that was spent into the system an the documentation. Nor do I question the complex federated system of OCR-D. But from time to time, it's worth to take a break and do some kind of retrospective. Which distributions are used by other libraries so far? Until today, ocrd_all has been pulled 1.4k times altogether from DockerHub (inclusive my own 100+ pulls ;)), but how many times was which distribution pulled?

@stweil Seriously, who really wants to administrate an OCR-System on ARM or PPC?

@bertsky Regarding documentation I had this tabular overview in mind: https://ocr-d.de/en/setup#ocrd_all-via-docker If it's contents are generated, all is fine. If somebody has to take care of this listings (that frightened me at first sight), it's bad.

Further, maximal flexibility is IMHO the core concern of the ocrd_all Repository as it is. A Container is a closed package for software distribution. Including Git Repositories per se into a Container Image to ease potential changes thwarts this philosophy.

Yes, I could pull by digest, but how do I know the digest of the pre-pre-pre-...-distribution when my recent pull just kicked the pre-pre-pre-...-distribution? Is the digest coming from docker or the revision system? I cannot infer the image digest from the commit SHA that triggered the CircleCI action to build the container image. And seriously, nobody can advise an end-user to track a list of digests by him- or herself.

Please, take a look again at DockerHub. Inspect some of the great, popular distributions for webservers, os, databases or alike. How are they using tags? How do they express the relations of vendor/company, application/product/service and versioning?

stweil commented 4 years ago

@stweil Seriously, who really wants to administrate an OCR-System on ARM or PPC?

That's not the point. If you want to use OCR for huge amounts of pages, you might consider tuning your environment to improve the performance by 10 to 20 %. That requires local optimisations which you will never get by containers which must fit everywhere.

bertsky commented 4 years ago

@M3ssman I really don't want to get involved with all the "customer cosmology" discussion here, so I'll be brief:

from an end-user perspective it is just way too complex

Reading which of the 6 options suits you best once and for all can hardly be called complex!

From an end-user perspective it would be even sufficient to have only a single container image to take care for. [...] So what can the minimum distribution be good for a real-world library? Everybody who really needs a special flavor of ocrd_all is free to clone the project local and the submodules as required. You don't need a official container image for this

You are forgetting that there are many different kinds of users out there, and Docker images are not just used directly, they are also the base for further distributions.

What is the point of this complaint (if not "complexity")? There's virtually no cost for you involved in the current scheme (since as you said all building/CI is automatic).

A Container is a closed package for software distribution. Including Git Repositories per se into a Container Image to ease potential changes thwarts this philosophy.

How? I just don't see your problem with that.

(I personally have a project which ships with a Docker image based on :maximum-git. Whenever I need to make small fixes or adjustments, I don't have to waste time and bandwidth to add small adjustment layers – by just adding a few RUN statements with some git/pip magic. This effectively speeds up my deployment cycle by 2 orders of magnitude!)

Yes, I could pull by digest, but how do I know the digest of the pre-pre-pre-...-distribution when my recent pull just kicked the pre-pre-pre-...-distribution?

It never "kicks" anything automatically. Unless you docker image pruned then you still have the older versions available on your system to run.

To start watching digest numbers in retrospect, just list all of them, watch for the dates, and pick the one from the date range that your respective result files and logs show. In the future, manage the digest numbers for local installation/distribution.

Is the digest coming from docker or the revision system?

From Docker. It's basically a reproducible hash of all the layers' contents (as usually revisions are for VCS). You can read about it in the documentation.

I cannot infer the image digest from the commit SHA that triggered the CircleCI action to build the container image

No you cannot. But why would you want that? You can always go the other direction.

And seriously, nobody can advise an end-user to track a list of digests by him- or herself.

What is an end user in this context? Any person responsible for a local OCR-D deployment will have to deal with some level of complexity. And if you need more than "most recent version", then yes, you'll have to track digests – that's true for any Docker deployed system.

Inspect some of the great, popular distributions for webservers, os, databases or alike. How are they using tags? How do they express the relations of vendor/company, application/product/service and versioning?

Again, OCR-D is not a single application/product/service, it's a federated system to integrate (ideally) any OCR/OLR related tool out there into a single productive framework. You can easily build a system tailored for some institution/use-case/purpose from it (including version tags).

cneud commented 4 years ago

Again, OCR-D is not a single application/product/service, it's a federated system to integrate (ideally) any OCR/OLR related tool out there into a single productive framework.

From the viewpoint of the OCR-D coordination project, I have to object to this notion. While we certainly aim - especially thanks to community contributors like @bertsky @stweil - for the greatest possible flexibility and choice for OCR configurations/workflows for specific use cases, it must not be forgotten that the main sponsor and the according funding calls specifically demand an OCR system/product that is fit for mass processing of the VD collections and that can be deployed with minimum effort by libraries. As we have seen in the pilot phase, the customization options offered in OCR-D overstrain users, potentially harming adoption. It is e.g. my view that OCR-D SHOULD provide a robust core product/service, which is also a major objective of the next phase. Of course, this does not exclude other configuration/deployment options, but the main focus must be on a solution for the VD that ideally works similar to an off-the-shelf product.

And about the release management issue - we, i.e. the OCR-D coordination project, believe a compromise for basic release management of the containers can be found here as well.

bertsky commented 4 years ago

@cneud what does ocrd_all container tagging have to do with configurations/workflows? This issue is about the limited flexibility of having 6 running-release Docker versions here – in a component encapsulating all modules' installation and mutual consistency. Your statements about sponsor objectives and library adoption make it sound like ocrd_all is the cause of (or obstacle to) the (much) larger problem of finding good combinations of tools and parameters, or complexity in general. Surely there's still a lot of work to be done to facilitate adoption in libraries, managing data-workflow dependencies, and controlling computation. But ocrd_all is just an adjunct to that, an enabler. It would simply overstrain this repo to try to solve all usability problems here. Just look at the open issues for what is still left to do just for that simple task.

Perhaps what you had in mind is something along the lines of ocrd_framework – a derived repo/image with added user interfaces that has version tags on it (and gets new releases more slowly)?

...we believe a compromise for basic release management of the containers can be found here as well.

Docker containers come with built-in revision numbers (digests, see above). The alternative would be to use tags, giving up the current scheme. What compromise do you have in mind?

M3ssman commented 4 years ago

Actually, I'm afraid we're at cross-purposes.

OCR-D-Tools are very agile, very fast-moving, which is fine regarding future research and development. It offers a great tool-box to fit a broad range of OCR-systems and combinations to please a wide range of scientific stakeholders.

And then there are people from Library-IT-stuff (like me), that fail to communicate their requirements. One reason for this is the

broad range of OCR-systems and combinations

Right now there are even three ways (programmatically, Makefile, Taverna) to use OCR-D-Tools.

Taking one step back, re-focus. What we need at ULB:

I'm totally aware of that OCR-D cannot provide a proper drop-in-solution for every library in the country. Especially, if these institutions struggle to fix and formulate their requirements. We must do this by ourselves.

@kba Is the changelog from #104 generated?

It's a step in the right direction (IMHO), but maybe it is way too detailed. Somebody like me, who hasn't followed the discussions and issues for all the tools, cannot make much out of revision SHAs. I believe it would be helpful to concentrate on the announcement of breaking changes and keep track of new features.

wrznr commented 4 years ago

@cneud I'd like to add some thoughts to two points of your argumentation

the customization options offered in OCR-D overstrain users

I am not sure if this was the problem. If you consider the lack of documentation in particular with respect to then missing workflow guidelines, it is not very surprising that users feel overwhelmed by the sheer amount of options which are available for each processing step. (As a part-time insider) I did not feel overstrained even for a second. And we should be honest with ourselves and to our funders: OCR for historic materials is a complex task. An in-depth examination of the methods and tools available will always be a requirement for high-quality results. Which leads me to my second point:

but the main focus must be on a solution for the VD that ideally works similar to an off-the-shelf product

While the first point is undoubted the second is highly (doubted). We should not aim for an open-source FineReader clone. Let the Transkribus people do that. Let us accompany our main focus by one common target: The best possible text quality which can be achieved with the tools available.

@M3ssman I understand that your initial focus was a somewhat different question which has much to do with the various deployment options. I totally agree that matters should be simplified here. IMHO we could simply drop Taverna. Nobody uses it productively anyway while the Makefile invocation has been successfully used for mass digitization.

we have a Tesseract-centered Workflow. Therefore we need a Tesseract-centered OCR-platform

It is already there, right? OCR-D for me is very much Tesseract-centered (which is a good thing!).

format-conversion to please our alto/xml presentation layer

It is already there, right? Via https://github.com/OCR-D/ocrd_fileformat you can easily generate ALTO.

some sort of evaluation tool included

We are all very aware of that desideratum. Press thumbs for stage 3 of the OCR-D funding.

stable versioning that follows common docker conventions

If this really helps convincing others to use OCR-D, why not. However, we will have to be kleinteilig for a long time, namely throughout the whole 3rd stage. The dynamics are as you stated above very high. And please remember: according to the OCR-D road map a stable productive system is the objective of the upcoming OCR-D funding (if at all the, the call is somewhat vague here).

M3ssman commented 4 years ago

As long-term strategy I suggest to support only one official distribution and a single workflow. IMHO from a librarian user's point of view having containers with Bertsky-Workflow is the best way.

Any kind of customization requires additional skills in the whole OCR-D-ecosystem. Everybody who's missing something is free to build a custom release based on ocrd_all anytime, since it is public accessible.

@bertsky Just for curiosity: Where do the 3 image-flavors originate? And why the additional version with git?

@kba Are more detailed download statistics available from Dockerhub besides the ones regarding image names? Maybe it's possible to deduce what is really needed from what's actually used.

@wrznr Thanks for pointing out ocrd_fileformat!

bertsky commented 4 years ago

stable versioning that follows common docker conventions

If this really helps convincing others to use OCR-D, why not. However, we will have to be kleinteilig for a long time, namely throughout the whole 3rd stage. The dynamics are as you stated above very high. And please remember: according to the OCR-D road map a stable productive system is the objective of the upcoming OCR-D funding (if at all the, the call is somewhat vague here). As long-term strategy I suggest to support only one official distribution and a single workflow.

I have pointed this out above already, but no-one seems to notice: There is no need to abandon the one for the other. We can have both a low-threshold, short-cycled, all-purpose ocrd/all with rolling releases and a new, derived repo (or at least Docker image) with proper version tags, release notes, and possibly added usability value for libraries, esp. first-time adopters.

Everybody who's missing something is free to build a custom release based on ocrd_all anytime, since it is public accessible.

Exactly. That's why ocrd/all itself should be the most general distribution! It's a big difference whether I have to build the images myself each time (which costs a lot of network bandwidth for both downloads and uploads), or I can just build upon the automatically produced ocrd/all images (with nearly no delay).

As I said above, my own projects already depend on this build workflow. And I cannot just say to my users/partners/sponsors "hey, just trust me I will personally keep building images in the future": of course they want a single permanent maintainer like OCR-D. It would be a vast loss of opportunity to suddenly exclude other use-cases / users here for no good reason. The more projects and users this can attract, the better for all involved. Not everyone wants to develop just for a (certain kind of) library.

just for curiosity: Where do the 3 image-flavors originate? And why the additional version with git?

@M3ssman I already told you in my very first reply. I introduced them when I made the Docker fat-container solution, and there was a consensus on the tagging scheme.

M3ssman commented 4 years ago

I do support @bertsky proposal to aim for a new Docker image or Repository which holds the image configuration. And not break existing stuff.

So next question arises: Which modules do you like to see inside such a distribution? As a start point, these are the modules we currently evaluate (and plan to use)

?

kba commented 4 years ago

@kba Are more detailed download statistics available from Dockerhub besides the ones regarding image names? Maybe it's possible to deduce what is really needed from what's actually used.

I'm afraid no. I can only see the downloads for all tags: image

@kba Is the changelog from #104 generated?

No, that was create manually. I wanted to see whether GitHub could render the commit shortids as links to the commits of the particular submodule. Unfortunately it does not.

As for the original issue: How about we:

This would allow casual users to be notified of new "releases" and could largely be automated without any changes to the existing (and IMHO very sensible min/medi/max git/non-git scheme) setup? @stweil @wrznr @cneud @bertsky @M3ssman Can you live with such an extension to the current setup? If so, I could start adapting the Circle CI script and a script to generate CHANGELOG, date-tag and release on Github.

This does not mean that there cannot be an additional "even more batteries included" variant/fork/enhancement of ocrd_all, maybe also including models for recognition and sample data...

M3ssman commented 4 years ago

@kba Yes, sounds great! But I'm afraid the medium distribution misses the cropping from anybaseocr, therefore I'd like to got for maximum, please.

To reduce additional workload, it's sufficient to trigger new builds for new git tags only and to enhance the tag labels as suggested.