Closed lukaszzyla closed 1 year ago
Usually a php fpm
setup consists of two containers, right? One webserver container and the fpm
container itself, where the php code is executed. Are you sure that the command ocrmypdf
is available inside of the fpm
container for the user running the fpm process?
And could you share your Docker setup so that we'd be able to reproduce this issue, including docker-compose
and/or image names, Dockerfile
etc.?
hi, thankyou for quick reply. I am trying to get it done for 2 days already. i started many new containers. i even tried downgrading to latest working ver 1.26.0 - but even after downgrading issue persists. it is strange because it was working fine before.
this is my portainer stack for one of new instances, i even tried installing new server on this setup but situation is exactly the same. docker with tesseract installed and ocrmypdf is nextcloud-fpm-custom and this is the container that is able to run ocrmydf as www-data from inside
everything is behind nginx proxy, thus SOMELOCALIP is a trusted proxy(nginx) being on the same host as docker with nc
portainer stack below: version: '2'
services: nc_rentrans_db: image: mariadb:10.5 restart: always command: --transaction-isolation=READ-COMMITTED --binlog-format=ROW volumes:
nginx-pm_default
rt_redis: image: redis:alpine
restart: always
app: image: nextcloud-fpm-custom restart: always links:
nginx-pm_default
web: image: nginx restart: always ports:
nginx-pm_default
cron: image: nextcloud:fpm restart: always
volumes:
networks: nginx-pm_default: external: true
---------------end of stack-----------
below /app/config/nginx.conf: worker_processes auto;
error_log /var/log/nginx/error.log warn; pid /var/run/nginx.pid;
events { worker_connections 1024; }
http { include /etc/nginx/mime.types; default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
sendfile on;
#tcp_nopush on;
# Prevent nginx HTTP Server Detection
server_tokens off;
keepalive_timeout 65;
#gzip on;
upstream php-handler {
server app:9000;
}
server {
listen 80;
# HSTS settings
# WARNING: Only add the preload option once you read about
# the consequences in https://hstspreload.org/. This option
# will add the domain to a hardcoded list that is shipped
# in all major browsers and getting removed from this list
# could take several months.
add_header Strict-Transport-Security "max-age=15768000; includeSubDomains; preload;" always;
# set max upload size
client_max_body_size 512M;
fastcgi_buffers 64 4K;
# Enable gzip but do not remove ETag headers
gzip on;
gzip_vary on;
gzip_comp_level 4;
gzip_min_length 256;
gzip_proxied expired no-cache no-store private no_last_modified no_etag auth;
gzip_types application/atom+xml application/javascript application/json application/ld+json application/manifest+json application/rss+xml application/vnd.geo+json application/vnd.ms-fontobject application/x-font-ttf application/x-web-app-manifest+json application/xhtml+xml application/xml font/opentype image/bmp image/svg+xml image/x-icon text/cache-manifest text/css text/plain text/vcard text/vnd.rim.location.xloc text/vtt text/x-component text/x-cross-domain-policy;
# Pagespeed is not supported by Nextcloud, so if your server is built
# with the `ngx_pagespeed` module, uncomment this line to disable it.
#pagespeed off;
# HTTP response headers borrowed from Nextcloud `.htaccess`
add_header Referrer-Policy "no-referrer" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-Download-Options "noopen" always;
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Permitted-Cross-Domain-Policies "none" always;
add_header X-Robots-Tag "noindex,nofollow" always;
add_header X-XSS-Protection "1; mode=block" always;
# Remove X-Powered-By, which is an information leak
fastcgi_hide_header X-Powered-By;
# Path to the root of your installation
root /var/www/html;
# Specify how to handle directories -- specifying `/index.php$request_uri`
# here as the fallback means that Nginx always exhibits the desired behaviour
# when a client requests a path that corresponds to a directory that exists
# on the server. In particular, if that directory contains an index.php file,
# that file is correctly served; if it doesn't, then the request is passed to
# the front-end controller. This consistent behaviour means that we don't need
# to specify custom rules for certain paths (e.g. images and other assets,
# `/updater`, `/ocm-provider`, `/ocs-provider`), and thus
# `try_files $uri $uri/ /index.php$request_uri`
# always provides the desired behaviour.
index index.php index.html /index.php$request_uri;
# Rule borrowed from `.htaccess` to handle Microsoft DAV clients
location = / {
if ( $http_user_agent ~ ^DavClnt ) {
return 302 /remote.php/webdav/$is_args$args;
}
}
location = /robots.txt {
allow all;
log_not_found off;
access_log off;
}
# Make a regex exception for `/.well-known` so that clients can still
# access it despite the existence of the regex rule
# `location ~ /(\.|autotest|...)` which would otherwise handle requests
# for `/.well-known`.
location ^~ /.well-known {
# The rules in this block are an adaptation of the rules
# in `.htaccess` that concern `/.well-known`.
location = /.well-known/carddav { return 301 https://SOMEHOST/remote.php/dav/; }
location = /.well-known/caldav { return 301 https://SOMEHOST/remote.php/dav/; }
location /.well-known/acme-challenge { try_files $uri $uri/ =404; }
location /.well-known/pki-validation { try_files $uri $uri/ =404; }
# Let Nextcloud's API for `/.well-known` URIs handle all other
# requests by passing them to the front-end controller.
return 301 https://SOMEHOST/index.php$request_uri;
}
# Rules borrowed from `.htaccess` to hide certain paths from clients
location ~ ^/(?:build|tests|config|lib|3rdparty|templates|data)(?:$|/) { return 404; }
location ~ ^/(?:\.|autotest|occ|issue|indie|db_|console) { return 404; }
# Ensure this block, which passes PHP files to the PHP process, is above the blocks
# which handle static assets (as seen below). If this block is not declared first,
# then Nginx will encounter an infinite rewriting loop when it prepends `/index.php`
# to the URI, resulting in a HTTP 500 error response.
location ~ \.php(?:$|/) {
# Required for legacy support
rewrite ^/(?!index|remote|public|cron|core\/ajax\/update|status|ocs\/v[12]|updater\/.+|oc[ms]-provider\/.+|.+\/richdocumentscode\/proxy) /index.php$request_uri;
fastcgi_split_path_info ^(.+?\.php)(/.*)$;
set $path_info $fastcgi_path_info;
try_files $fastcgi_script_name =404;
include fastcgi_params;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
fastcgi_param PATH_INFO $path_info;
fastcgi_param HTTPS on;
fastcgi_param modHeadersAvailable true; # Avoid sending the security headers twice
fastcgi_param front_controller_active true; # Enable pretty urls
fastcgi_pass php-handler;
fastcgi_intercept_errors on;
fastcgi_request_buffering off;
}
location ~ \.(?:css|js|svg|gif)$ {
try_files $uri /index.php$request_uri;
expires 6M; # Cache-Control policy borrowed from `.htaccess`
access_log off; # Optional: Don't log access to assets
}
location ~ \.woff2?$ {
try_files $uri /index.php$request_uri;
expires 7d; # Cache-Control policy borrowed from `.htaccess`
access_log off; # Optional: Don't log access to assets
}
# Rule borrowed from `.htaccess`
location /remote {
return 301 https://SOMEHOST/remote.php$request_uri;
}
location / {
try_files $uri $uri/ /index.php$request_uri;
}
}
}
------------end of nginx.conf----------------
below config.php
<?php $CONFIG = array ( 'memcache.local' => '\OC\Memcache\APCu', 'apps_paths' => array ( 0 => array ( 'path' => '/var/www/html/apps', 'url' => '/apps', 'writable' => false, ), 1 => array ( 'path' => '/var/www/html/custom_apps', 'url' => '/custom_apps', 'writable' => true, ), ), 'instanceid' => '...id...', 'passwordsalt' => '...salt...', 'secret' => '...secret...', 'trusted_domains' => array ( 0 => 'SOMELOCALIP', 1 => 'SOMEHOST', ), 'allow_local_remote_servers' => true, 'datadirectory' => '/var/www/html/data', 'dbtype' => 'mysql', 'version' => '26.0.0.11', 'trusted_proxies' => 'SOMELOCALIP', 'overwrite.cli.url' => 'https://SOMEHOST', 'overwriteprotocol' => 'https', 'dbname' => 'nextcloud', 'dbhost' => 'nextclouddb', 'dbport' => '', 'dbtableprefix' => 'oc', 'mysql.utf8mb4' => true, 'dbuser' => '.......', 'dbpassword' => '......', 'installed' => true, 'mail_smtpmode' => 'smtp', 'mail_smtpsecure' => 'ssl', 'mail_sendmailmode' => 'smtp', 'mail_from_address' => '.........', 'mail_domain' => 'gmail.com', 'mail_smtpauthtype' => 'LOGIN', 'mail_smtpauth' => 1, 'mail_smtphost' => 'smtp.gmail.com', 'mail_smtpport' => '465', 'mail_smtpname' => '.....@gmail.com', 'mail_smtppassword' => '.....', 'default_phone_region' => 'PL', 'ldapProviderFactory' => 'OCA\User_LDAP\LDAPProviderFactory', 'maintenance' => false, 'theme' => '', 'loglevel' => 2, 'app_install_overwrite' => array ( 0 => 'pdfdraw', 1 => 'ocjobs', 2 => 'files_fulltextsearch_tesseract', 3 => 'files_fulltextsearch', 4 => 'fulltextsearch', 5 => 'fulltextsearch_elasticsearch', 6 => 'richdocumentscode', ), );
This could be an issue:
# ...
cron:
image: nextcloud:fpm
restart: always
# ...
Since the processing of the OCR is done asynchronously via Nextcloud cron engine, the process who executes the cron.php
has to be able to access ocrmypdf
. The default image nextcloud:fpm
has no ocrmypdf
installed. I'm expecting you've installed ocrmypdf
inside of your custom image nextcloud-fpm-custom
, so you could just try to replace image: nextcloud:fpm
by image: nextcloud-fpm-custom
. Please also make sure your custom image is capable of executing the cron via entrypoint /cron.sh
.
Sidenote: If you're looking for some prebuilt extended Docker image for Nextcloud, https://github.com/R0Wi/nextcloud-docker-extended might be interesting for you. Currently I'm only supporting the apache
based images but we could add the fpm
ones, too. The images are automatically updated once a day by fetching the appropriate upstream.
Looking forward to hear your results 😃
works like charm again! Thank you so much!!! swapping cron image to the custom one has made everything work again. I could notice when forcing php cron.php by the waiting time... How come it used to work before? the most important it is fine again. Thank you for your work. Ocrmypdf looks to be great tool. I am currently planning to speed up document delivery by camera scanning and OCRing them in nextcloud. I hope it would be possible to sort them later on, eg by customer or shipper.
works like charm again! Thank you so much!!!
Glad to hear that things are working again 👍
How come it used to work before?
Well that's indeed an interesting question. Don't know if you might did some minor changes to your setup but generally speaking: the container which executes the cron.php
script needs to have ocrmypdf
installed. That's mandatory
I am currently planning to speed up document delivery by camera scanning and OCRing them in nextcloud. I hope it would be possible to sort them later on, eg by customer or shipper.
Sounds interesting. I'm using the OCR app together with the Full Text Search app and this works quite well when "scanning" documents with my smartphone, uploading them to NC, getting them OCRred and then indexed into the Elasticsearch database
Your extended image looks like a solution to my issues ;-) How do you solve tesseract-ocr languages in your builds? They should be built in already with ocrmypdf?
Well currently only German and English are installed (see here) but I could extend this of course :+1:
that would be great if you could. my set includes: eng deu fra ita nld pol
I have a built docker image of nextcloud26:fpm with tesseract in few languages and ocrmypdf built in. it used to work flawlessly but after last update i only keep seeing:
user www-data can use ocrmypdf from cmd inside the container and it generates output. replicated on another instance brand new install