R0Wi-DEV / workflow_ocr

This is a Nextcloud Workflow App which enables you to process files via OCR on serverside.
GNU Affero General Public License v3.0
79 stars 7 forks source link

WORKFLOW OCR stopped working NC26 after latest upgrade #202

Closed lukaszzyla closed 1 year ago

lukaszzyla commented 1 year ago

I have a built docker image of nextcloud26:fpm with tesseract in few languages and ocrmypdf built in. it used to work flawlessly but after last update i only keep seeing:

Error workflow_ocr OCR for file /Lukasz/files/WZ THOMAS.pdf not possible. Message: OCRmyPDF did not produce any output   2023-04-13T08:10:00+0200
Warning workflow_ocr OCRmyPDF succeeded with warning(s): sh: 1: ocrmypdf: not found,

user www-data can use ocrmypdf from cmd inside the container and it generates output. replicated on another instance brand new install

R0Wi commented 1 year ago

Usually a php fpm setup consists of two containers, right? One webserver container and the fpm container itself, where the php code is executed. Are you sure that the command ocrmypdf is available inside of the fpm container for the user running the fpm process?

And could you share your Docker setup so that we'd be able to reproduce this issue, including docker-compose and/or image names, Dockerfile etc.?

lukaszzyla commented 1 year ago

hi, thankyou for quick reply. I am trying to get it done for 2 days already. i started many new containers. i even tried downgrading to latest working ver 1.26.0 - but even after downgrading issue persists. it is strange because it was working fine before.

this is my portainer stack for one of new instances, i even tried installing new server on this setup but situation is exactly the same. docker with tesseract installed and ocrmypdf is nextcloud-fpm-custom and this is the container that is able to run ocrmydf as www-data from inside

everything is behind nginx proxy, thus SOMELOCALIP is a trusted proxy(nginx) being on the same host as docker with nc

portainer stack below: version: '2'

services: nc_rentrans_db: image: mariadb:10.5 restart: always command: --transaction-isolation=READ-COMMITTED --binlog-format=ROW volumes:

networks: nginx-pm_default: external: true

---------------end of stack-----------

below /app/config/nginx.conf: worker_processes auto;

error_log /var/log/nginx/error.log warn; pid /var/run/nginx.pid;

events { worker_connections 1024; }

http { include /etc/nginx/mime.types; default_type application/octet-stream;

log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                  '$status $body_bytes_sent "$http_referer" '
                  '"$http_user_agent" "$http_x_forwarded_for"';

access_log  /var/log/nginx/access.log  main;

sendfile        on;
#tcp_nopush     on;

# Prevent nginx HTTP Server Detection
server_tokens   off;

keepalive_timeout  65;

#gzip  on;

upstream php-handler {
    server app:9000;
}

server {
    listen 80;

    # HSTS settings
    # WARNING: Only add the preload option once you read about
    # the consequences in https://hstspreload.org/. This option
    # will add the domain to a hardcoded list that is shipped
    # in all major browsers and getting removed from this list
    # could take several months.
    add_header Strict-Transport-Security "max-age=15768000; includeSubDomains; preload;" always;

    # set max upload size
    client_max_body_size 512M;
    fastcgi_buffers 64 4K;

    # Enable gzip but do not remove ETag headers
    gzip on;
    gzip_vary on;
    gzip_comp_level 4;
    gzip_min_length 256;
    gzip_proxied expired no-cache no-store private no_last_modified no_etag auth;
    gzip_types application/atom+xml application/javascript application/json application/ld+json application/manifest+json application/rss+xml application/vnd.geo+json application/vnd.ms-fontobject application/x-font-ttf application/x-web-app-manifest+json application/xhtml+xml application/xml font/opentype image/bmp image/svg+xml image/x-icon text/cache-manifest text/css text/plain text/vcard text/vnd.rim.location.xloc text/vtt text/x-component text/x-cross-domain-policy;

    # Pagespeed is not supported by Nextcloud, so if your server is built
    # with the `ngx_pagespeed` module, uncomment this line to disable it.
    #pagespeed off;

    # HTTP response headers borrowed from Nextcloud `.htaccess`
    add_header Referrer-Policy                      "no-referrer"   always;
    add_header X-Content-Type-Options               "nosniff"       always;
    add_header X-Download-Options                   "noopen"        always;
    add_header X-Frame-Options                      "SAMEORIGIN"    always;
    add_header X-Permitted-Cross-Domain-Policies    "none"          always;
    add_header X-Robots-Tag                         "noindex,nofollow"          always;
    add_header X-XSS-Protection                     "1; mode=block" always;

    # Remove X-Powered-By, which is an information leak
    fastcgi_hide_header X-Powered-By;

    # Path to the root of your installation
    root /var/www/html;

    # Specify how to handle directories -- specifying `/index.php$request_uri`
    # here as the fallback means that Nginx always exhibits the desired behaviour
    # when a client requests a path that corresponds to a directory that exists
    # on the server. In particular, if that directory contains an index.php file,
    # that file is correctly served; if it doesn't, then the request is passed to
    # the front-end controller. This consistent behaviour means that we don't need
    # to specify custom rules for certain paths (e.g. images and other assets,
    # `/updater`, `/ocm-provider`, `/ocs-provider`), and thus
    # `try_files $uri $uri/ /index.php$request_uri`
    # always provides the desired behaviour.
    index index.php index.html /index.php$request_uri;

    # Rule borrowed from `.htaccess` to handle Microsoft DAV clients
    location = / {
        if ( $http_user_agent ~ ^DavClnt ) {
            return 302 /remote.php/webdav/$is_args$args;
        }
    }

    location = /robots.txt {
        allow all;
        log_not_found off;
        access_log off;
    }

    # Make a regex exception for `/.well-known` so that clients can still
    # access it despite the existence of the regex rule
    # `location ~ /(\.|autotest|...)` which would otherwise handle requests
    # for `/.well-known`.
    location ^~ /.well-known {
        # The rules in this block are an adaptation of the rules
        # in `.htaccess` that concern `/.well-known`.

        location = /.well-known/carddav { return 301 https://SOMEHOST/remote.php/dav/; }
        location = /.well-known/caldav  { return 301 https://SOMEHOST/remote.php/dav/; }

        location /.well-known/acme-challenge    { try_files $uri $uri/ =404; }
        location /.well-known/pki-validation    { try_files $uri $uri/ =404; }

        # Let Nextcloud's API for `/.well-known` URIs handle all other
        # requests by passing them to the front-end controller.
        return 301 https://SOMEHOST/index.php$request_uri;
    }

    # Rules borrowed from `.htaccess` to hide certain paths from clients
    location ~ ^/(?:build|tests|config|lib|3rdparty|templates|data)(?:$|/)  { return 404; }
    location ~ ^/(?:\.|autotest|occ|issue|indie|db_|console)                { return 404; }

    # Ensure this block, which passes PHP files to the PHP process, is above the blocks
    # which handle static assets (as seen below). If this block is not declared first,
    # then Nginx will encounter an infinite rewriting loop when it prepends `/index.php`
    # to the URI, resulting in a HTTP 500 error response.
    location ~ \.php(?:$|/) {
        # Required for legacy support
        rewrite ^/(?!index|remote|public|cron|core\/ajax\/update|status|ocs\/v[12]|updater\/.+|oc[ms]-provider\/.+|.+\/richdocumentscode\/proxy) /index.php$request_uri;

        fastcgi_split_path_info ^(.+?\.php)(/.*)$;
        set $path_info $fastcgi_path_info;

        try_files $fastcgi_script_name =404;

        include fastcgi_params;
        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
        fastcgi_param PATH_INFO $path_info;
        fastcgi_param HTTPS on;

        fastcgi_param modHeadersAvailable true;         # Avoid sending the security headers twice
        fastcgi_param front_controller_active true;     # Enable pretty urls
        fastcgi_pass php-handler;

        fastcgi_intercept_errors on;
        fastcgi_request_buffering off;
    }

    location ~ \.(?:css|js|svg|gif)$ {
        try_files $uri /index.php$request_uri;
        expires 6M;         # Cache-Control policy borrowed from `.htaccess`
        access_log off;     # Optional: Don't log access to assets
    }

    location ~ \.woff2?$ {
        try_files $uri /index.php$request_uri;
        expires 7d;         # Cache-Control policy borrowed from `.htaccess`
        access_log off;     # Optional: Don't log access to assets
    }

    # Rule borrowed from `.htaccess`
    location /remote {
        return 301 https://SOMEHOST/remote.php$request_uri;
    }

    location / {
        try_files $uri $uri/ /index.php$request_uri;
    }
}

}

------------end of nginx.conf----------------

below config.php

<?php $CONFIG = array ( 'memcache.local' => '\OC\Memcache\APCu', 'apps_paths' => array ( 0 => array ( 'path' => '/var/www/html/apps', 'url' => '/apps', 'writable' => false, ), 1 => array ( 'path' => '/var/www/html/custom_apps', 'url' => '/custom_apps', 'writable' => true, ), ), 'instanceid' => '...id...', 'passwordsalt' => '...salt...', 'secret' => '...secret...', 'trusted_domains' => array ( 0 => 'SOMELOCALIP', 1 => 'SOMEHOST', ), 'allow_local_remote_servers' => true, 'datadirectory' => '/var/www/html/data', 'dbtype' => 'mysql', 'version' => '26.0.0.11', 'trusted_proxies' => 'SOMELOCALIP', 'overwrite.cli.url' => 'https://SOMEHOST', 'overwriteprotocol' => 'https', 'dbname' => 'nextcloud', 'dbhost' => 'nextclouddb', 'dbport' => '', 'dbtableprefix' => 'oc', 'mysql.utf8mb4' => true, 'dbuser' => '.......', 'dbpassword' => '......', 'installed' => true, 'mail_smtpmode' => 'smtp', 'mail_smtpsecure' => 'ssl', 'mail_sendmailmode' => 'smtp', 'mail_from_address' => '.........', 'mail_domain' => 'gmail.com', 'mail_smtpauthtype' => 'LOGIN', 'mail_smtpauth' => 1, 'mail_smtphost' => 'smtp.gmail.com', 'mail_smtpport' => '465', 'mail_smtpname' => '.....@gmail.com', 'mail_smtppassword' => '.....', 'default_phone_region' => 'PL', 'ldapProviderFactory' => 'OCA\User_LDAP\LDAPProviderFactory', 'maintenance' => false, 'theme' => '', 'loglevel' => 2, 'app_install_overwrite' => array ( 0 => 'pdfdraw', 1 => 'ocjobs', 2 => 'files_fulltextsearch_tesseract', 3 => 'files_fulltextsearch', 4 => 'fulltextsearch', 5 => 'fulltextsearch_elasticsearch', 6 => 'richdocumentscode', ), );

R0Wi commented 1 year ago

This could be an issue:

# ...
cron:
    image: nextcloud:fpm
    restart: always
# ...

Since the processing of the OCR is done asynchronously via Nextcloud cron engine, the process who executes the cron.php has to be able to access ocrmypdf. The default image nextcloud:fpm has no ocrmypdf installed. I'm expecting you've installed ocrmypdf inside of your custom image nextcloud-fpm-custom, so you could just try to replace image: nextcloud:fpm by image: nextcloud-fpm-custom. Please also make sure your custom image is capable of executing the cron via entrypoint /cron.sh.

Sidenote: If you're looking for some prebuilt extended Docker image for Nextcloud, https://github.com/R0Wi/nextcloud-docker-extended might be interesting for you. Currently I'm only supporting the apache based images but we could add the fpm ones, too. The images are automatically updated once a day by fetching the appropriate upstream.

Looking forward to hear your results 😃

lukaszzyla commented 1 year ago

works like charm again! Thank you so much!!! swapping cron image to the custom one has made everything work again. I could notice when forcing php cron.php by the waiting time... How come it used to work before? the most important it is fine again. Thank you for your work. Ocrmypdf looks to be great tool. I am currently planning to speed up document delivery by camera scanning and OCRing them in nextcloud. I hope it would be possible to sort them later on, eg by customer or shipper.

R0Wi commented 1 year ago

works like charm again! Thank you so much!!!

Glad to hear that things are working again 👍

How come it used to work before?

Well that's indeed an interesting question. Don't know if you might did some minor changes to your setup but generally speaking: the container which executes the cron.php script needs to have ocrmypdf installed. That's mandatory

I am currently planning to speed up document delivery by camera scanning and OCRing them in nextcloud. I hope it would be possible to sort them later on, eg by customer or shipper.

Sounds interesting. I'm using the OCR app together with the Full Text Search app and this works quite well when "scanning" documents with my smartphone, uploading them to NC, getting them OCRred and then indexed into the Elasticsearch database

lukaszzyla commented 1 year ago

Your extended image looks like a solution to my issues ;-) How do you solve tesseract-ocr languages in your builds? They should be built in already with ocrmypdf?

R0Wi commented 1 year ago

Well currently only German and English are installed (see here) but I could extend this of course :+1:

lukaszzyla commented 1 year ago

that would be great if you could. my set includes: eng deu fra ita nld pol

R0Wi commented 1 year ago

There you go https://hub.docker.com/layers/r0wi/nextcloud-extended/26-apache/images/sha256-eafd163bc18a91855a55a711bf2b52f1aa3c26f21da42dec256c005863b69362?context=explore :rocket: