eikek / docspell

Assist in organizing your piles of documents, resulting from scanners, e-mails and other sources with miminal effort.
https://docspell.org
GNU Affero General Public License v3.0
1.58k stars 119 forks source link

Problems with addons #1734

Closed gandy92 closed 2 years ago

gandy92 commented 2 years ago

I've just upgraded from 0.32.0 to 0.38.0 to try out all the amazing new features added in the meantime. However, I'm struggling with addons and I seem to just not be able to get them enabled.

I've added the addons sections to rest-server.conf and joex.conf. However, when I restart my docker-compose setup, I constantly get a 404 on http://localhost:7880/api/v1/sec/addon/run-config, stating

{
  "success": false,
  "message": "Addons disabled"
}

I've checked the logs, but apart from the 404 messages, I couldn't find any means to further debug this. Please advise.

eikek commented 2 years ago

Hi @gandy92 I think the config might not have been picked up. The 404 usually indicates that the feature is disabled.

You need to add this to your config (restserver):

docspell.server.backend.addons.enabled=true

or you can copy the corresponding part from the default config. Maybe there is some typo or the hierarchy doesn't match. Can you show me your config file?

gandy92 commented 2 years ago

Hi @eikek I've just checked by opening a shell into the rest-server and joex docker instances, and confirmed that both /etc/docspell/rest-server.conf and /etc/docspell/joex.conf contain the addon sections I added. Here are my config files (without comments):

rest-server.conf:

docspell.server {
  app-name = "Docspell"
  app-id = "rest1"
  base-url = "http://localhost:7880"
  internal-url = "http://localhost:7880"
  bind {
    address = "localhost"
    port = 7880
  }
  max-item-page-size = 200
  max-note-length = 180
  show-classification-settings = true
  auth {
    server-secret = ""
    session-valid = "5 minutes"
    remember-me {
      enabled = true
      valid = "30 days"
    }
  }
  openid =
    [ { enabled = false,
        display = "Keycloak"
        provider = {
          provider-id = "keycloak",
          client-id = "docspell",
          client-secret = "example-secret-439e-bf06-911e4cdd56a6",
          scope = "profile", # scope is required for OIDC
          authorize-url = "http://localhost:8080/auth/realms/home/protocol/openid-connect/auth",
          token-url = "http://localhost:8080/auth/realms/home/protocol/openid-connect/token",
          sign-key = "b64:...",
          sig-algo = "RS512"
        },
        collective-key = "lookup:docspell_collective",
        user-key = "preferred_username"
      },
      { enabled = false,
        display = "Github"
        provider = {
          provider-id = "github",
          client-id = "<your github client id>",
          client-secret = "<your github client secret>",
          scope = "", # scope is not needed for github
          authorize-url = "https://github.com/login/oauth/authorize",
          token-url = "https://github.com/login/oauth/access_token",
          user-url = "https://api.github.com/user",
          sign-key = "" # this must be set empty
          sig-algo = "RS256" #unused but must be set to something
        },
        collective-key = "fixed:demo",
        user-key = "login"
      }
    ]
  integration-endpoint {
    enabled = false
    priority = "low"
    source-name = "integration"
    allowed-ips {
      enabled = false
      ips = [ "127.0.0.1" ]
    }
    http-basic {
      enabled = false
      realm = "Docspell Integration"
      user = "docspell-int"
      password = "docspell-int"
    }
    http-header {
      enabled = false
      header-name = "Docspell-Integration"
      header-value = "some-secret"
    }
  }
  admin-endpoint {
    secret = ""
  }
  full-text-search {
    enabled = false
    solr = {
      url = "http://localhost:8983/solr/docspell"
      commit-within = 1000
      log-verbose = false
      def-type = "lucene"
      q-op = "OR"
    }
  }
  backend {
    mail-debug = false
    jdbc {
      url = "jdbc:h2://"${java.io.tmpdir}"/docspell-demo.db;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE"
      user = "sa"
      password = ""
    }
    signup {
      mode = "open"
      new-invite-password = ""
      invite-time = "3 days"
    }
    files {
      chunk-size = 524288
      valid-mime-types = [ ]
    }
    addons = {
      enabled = true
      allow-impure = true
      allowed-urls = "*"
      denied-urls = ""
    }
  }
}

joex.conf:

docspell.joex {
  app-id = "joex1"
  base-url = "http://docspell-joex:7878"
  bind {
    address = "0.0.0.0"
    port = 7878
  }
  jdbc {
    url = "jdbc:postgresql://db:5432/docspell"
    user = "docspell"
    password = "..."
  }
  mail-debug = false
  send-mail {
    list-id = ""
  }
  scheduler {
    name = ${docspell.joex.app-id}
    pool-size = 4
    counting-scheme = "4,1"
    retries = 2
    retry-delay = "1 minute"
    log-buffer-size = 500
    wakeup-period = "30 minutes"
  }
  periodic-scheduler {
    name = ${docspell.joex.app-id}
    wakeup-period = "10 minutes"
  }
  user-tasks {
    scan-mailbox {
      max-folders = 50
      mail-chunk-size = 50
      max-mails = 500
    }
  }
  house-keeping {
    schedule = "Sun *-*-* 00:00:00"
    cleanup-invites = {
      enabled = true
      older-than = "30 days"
    }
    cleanup-remember-me = {
      enabled = true
      older-than = "30 days"
    }
    cleanup-jobs = {
      enabled = true
      older-than = "30 days"
      delete-batch = "100"
    }
    check-nodes {
      enabled = true
      min-not-found = 2
    }
  }
  update-check {
    enabled = false
    test-run = false
    schedule = "Sun *-*-* 00:00:00"
    sender-account = ""
    smtp-id = ""
    recipients = []
    subject = "Docspell {{ latestVersion }} is available"
    body = """
Hello,
You are currently running Docspell {{ currentVersion }}. Version *{{ latestVersion }}*
is now available, which was released on {{ releasedAt }}. Check the release page at:
<https://github.com/eikek/docspell/releases/latest>
Have a nice day!
Docpell Update Check
"""
  }
  extraction {
    pdf {
      min-text-len = 500
    }
    preview {
      dpi = 32
    }
    ocr {
      max-image-size = 14000000
      page-range {
        begin = 10
      }
      ghostscript {
        command {
          program = "gs"
          args = [ "-dNOPAUSE"
                 , "-dBATCH"
                 , "-dSAFER"
                 , "-sDEVICE=tiffscaled8"
                 , "-sOutputFile={{outfile}}"
                 , "{{infile}}"
                 ]
          timeout = "25 minutes"
        }
        working-dir = ${java.io.tmpdir}"/docspell-extraction"
      }
      unpaper {
        command {
          program = "unpaper"
          args = [ "{{infile}}", "{{outfile}}" ]
          timeout = "25 minutes"
        }
      }
      tesseract {
        command {
          program = "tesseract"
          args = ["{{file}}"
                 , "stdout"
                 , "-l"
                 , "{{lang}}"
                 ]
          timeout = "55 minutes"
        }
      }
    }
  }
  text-analysis {
    max-length = 5000
    working-dir = ${java.io.tmpdir}"/docspell-analysis"
    nlp {
      mode = full
      clear-interval = "15 minutes"
      max-due-date-years = 10
      regex-ner {
        max-entries = 1000
        file-cache-time = "1 minute"
      }
    }
    classification {
      enabled = true
      item-count = 600
      classifiers = [
        { "useSplitWords" = "true"
          "splitWordsTokenizerRegexp" = """[\p{L}][\p{L}0-9]*|(?:\$ ?)?[0-9]+(?:\.[0-9]{2})?%?|\s+|."""
          "splitWordsIgnoreRegexp" = """\s+"""
          "useSplitPrefixSuffixNGrams" = "true"
          "maxNGramLeng" = "4"
          "minNGramLeng" = "1"
          "splitWordShape" = "chris4"
          "intern" = "true" # makes it slower but saves memory
        }
      ]
    }
  }
  convert {
    chunk-size = ${docspell.joex.files.chunk-size}
    converted-filename-part = "converted"
    max-image-size = ${docspell.joex.extraction.ocr.max-image-size}
    markdown {
      internal-css = """
        body { padding: 2em 5em; }
      """
    }
    wkhtmlpdf {
      command = {
        program = "wkhtmltopdf"
        args = [
          "-s",
          "A4",
          "--encoding",
          "{{encoding}}",
          "--load-error-handling", "ignore",
          "--load-media-error-handling", "ignore",
          "-",
          "{{outfile}}"
        ]
        timeout = "25 minutes"
      }
      working-dir = ${java.io.tmpdir}"/docspell-convert"
    }
    tesseract = {
      command = {
        program = "tesseract"
        args = [
          "{{infile}}",
          "out",
          "-l",
          "{{lang}}",
          "pdf",
          "txt"
        ]
        timeout = "25 minutes"
      }
      working-dir = ${java.io.tmpdir}"/docspell-convert"
    }
    unoconv = {
      command = {
        program = "unoconv"
        args = [
          "-f",
          "pdf",
          "-o",
          "{{outfile}}",
          "{{infile}}"
        ]
        timeout = "20 minutes"
      }
      working-dir = ${java.io.tmpdir}"/docspell-convert"
    }
    ocrmypdf = {
      enabled = true
      command = {
        program = "ocrmypdf"
        args = [
          "-l", "{{lang}}",
          "--skip-text",
          "--rotate-pages",
          "--deskew",
          "--optimize", "2",
          "--tesseract-timeout", "2400"
          "-j", "4",
          "{{infile}}",
          "{{outfile}}"
        ]
        timeout = "55 minutes"
      }
      working-dir = ${java.io.tmpdir}"/docspell-convert"
    }
    decrypt-pdf = {
      enabled = true
      passwords = []
    }
  }
  files {
    chunk-size = 524288
    valid-mime-types = [ ]
  }
  full-text-search {
    enabled = true
    solr = {
      url = "http://docspell-solr:8983/solr/docspell"
      commit-within = 1000
      log-verbose = false
      def-type = "lucene"
      q-op = "OR"
    }
    migration = {
      index-all-chunk = 10
    }
  }
  addons {
    working-dir = ${java.io.tmpdir}"/docspell-addons"
    cache-dir = ${java.io.tmpdir}"/docspell-addon-cache"
    executor-config {
      runner = "nix-flake, docker, trivial"
      nspawn = {
        enabled = true
        sudo-binary = "sudo"
        nspawn-binary = "systemd-nspawn"
        container-wait = "100 millis"
      }
      fail-fast = true
      run-timeout = "15 minutes"
      nix-runner {
        nix-binary = "nix"
        build-timeout = "15 minutes"
      }
      docker-runner {
        docker-binary = "docker"
        build-timeout = "15 minutes"
      }
    }
  }
}

I've noticed there are more sections to the default config files I haven't copied over to my files, yet, e.g. for logging. Is it possible the addons depend on any of those?

eikek commented 2 years ago

Hi @gandy92 - the config looks fine to me. I initially overlooked that you are running with docker-compose: are you really using a config file or do you use env variables for configuring? The default docker-compose.yml uses env variables. If you want to use a config file, you need to declare it as an argument to the executable.

gandy92 commented 2 years ago

Thank you @eikek, your suggestion was spot on, my docker-compose file made use of environment variables while the command argument pointing to the config file was commented out. After weaving all environment variables in, I'm now running on config file and addons are enabled.

I've installed the rotate addon and configured it like suggested in the documentation. But now I'm facing the problem that my docker image lacks nix. Is there by any chance already a docker image I can use or will I have to dive into that rabbit hole myself? :smile:

eikek commented 2 years ago

Using nix is an option, but there are others that might be easier to use in your case. Maybe the simplest one is creating a new image based on joex and adding qpdf and pdftotext into it (mentioned in the readme). Otherwise docker can be used as well, but since the process running in a container needs to connect to the docker daemon, there is a mount necessary - it is commented out in the docker-compose file:

    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /tmp:/tmp

This is only for the joex image, of course. Hope it works with one of these.

gandy92 commented 2 years ago

Indeed, I got a few steps further, the only allow method is now docker and I've seen the image build successfully, but then it bailed on me with:

docspell-joex | 2022.09.10 22:07:04:0004 [io-comp...] [INFO ] docspell.scheduler.impl.LogSink.logInternal:41 - >>> 2022-09-10T20:07:05.384109Z Info 2TL5THqUT.../priv/addon-existing-item/High: Running 1 addon tasks for trigger Set(ExistingItem) (jobId: "2TL5THqUTeN-Pyr13vvTwHL-wpqtDXjhcop-YB6vyYUPkdc", task: "addon-existing-item", group: "priv", jobInfo: "2TL5THqUT.../priv/addon-existing-item/High")
docspell-joex | 2022.09.10 22:07:05:0000 [io-comp...] [INFO ] docspell.scheduler.impl.LogSink.logInternal:41 - >>> 2022-09-10T20:07:05.444030Z Info 2TL5THqUT.../priv/addon-existing-item/High: About to run 1 addon(s) in /tmp/docspell-addons/addon-17682764735611401268 (task: "addon-existing-item", addon-task-id: "DRja5jbyBk6-CSuxh3xPyYt-N8jYhUNDAGH-puK5aYxggYc", jobInfo: "2TL5THqUT.../priv/addon-existing-item/High", jobId: "2TL5THqUTeN-Pyr13vvTwHL-wpqtDXjhcop-YB6vyYUPkdc", group: "priv")
docspell-joex | 2022.09.10 22:07:05:0001 [io-comp...] [INFO ] docspell.scheduler.impl.LogSink.logInternal:41 - >>> 2022-09-10T20:07:05.444779Z Info 2TL5THqUT.../priv/addon-existing-item/High: Extract 1 addons to /tmp/docspell-addons/addon-17682764735611401268/addons (task: "addon-existing-item", addon-task-id: "DRja5jbyBk6-CSuxh3xPyYt-N8jYhUNDAGH-puK5aYxggYc", jobInfo: "2TL5THqUT.../priv/addon-existing-item/High", jobId: "2TL5THqUTeN-Pyr13vvTwHL-wpqtDXjhcop-YB6vyYUPkdc", group: "priv")
docspell-joex | 2022.09.10 22:07:05:0000 [io-comp...] [INFO ] docspell.scheduler.impl.LogSink.logInternal:41 - >>> 2022-09-10T20:07:05.487839Z Info 2TL5THqUT.../priv/addon-existing-item/High: Executing addon rotate-pdf-addon-0.2.0-pre (addon-version: "0.2.0-pre", task: "addon-existing-item", addon-task-id: "DRja5jbyBk6-CSuxh3xPyYt-N8jYhUNDAGH-puK5aYxggYc", jobInfo: "2TL5THqUT.../priv/addon-existing-item/High", addon-name: "rotate-pdf-addon", jobId: "2TL5THqUTeN-Pyr13vvTwHL-wpqtDXjhcop-YB6vyYUPkdc", group: "priv")
docspell-joex | 2022.09.10 22:07:05:0000 [io-comp...] [INFO ] docspell.scheduler.impl.LogSink.logInternal:41 - >>> 2022-09-10T20:07:05.490539Z Info 2TL5THqUT.../priv/addon-existing-item/High: Building docker image for addon rotate-pdf-addon-0.2.0-pre (addon-version: "0.2.0-pre", task: "addon-existing-item", addon-task-id: "DRja5jbyBk6-CSuxh3xPyYt-N8jYhUNDAGH-puK5aYxggYc", jobInfo: "2TL5THqUT.../priv/addon-existing-item/High", addon-name: "rotate-pdf-addon", jobId: "2TL5THqUTeN-Pyr13vvTwHL-wpqtDXjhcop-YB6vyYUPkdc", group: "priv")
docspell-joex | 2022.09.10 22:07:15:0000 [io-comp...] [INFO ] docspell.scheduler.impl.LogSink.logInternal:41 - >>> 2022-09-10T20:07:15.993034Z Info 2TL5THqUT.../priv/addon-existing-item/High: Docker image built successfully (addon-version: "0.2.0-pre", task: "addon-existing-item", addon-task-id: "DRja5jbyBk6-CSuxh3xPyYt-N8jYhUNDAGH-puK5aYxggYc", jobInfo: "2TL5THqUT.../priv/addon-existing-item/High", addon-name: "rotate-pdf-addon", jobId: "2TL5THqUTeN-Pyr13vvTwHL-wpqtDXjhcop-YB6vyYUPkdc", group: "priv")
docspell-joex | 2022.09.10 22:07:17:0000 [io-comp...] [ERROR] docspell.scheduler.impl.LogSink.logInternal:51 - >>> 2022-09-10T20:07:17.714785Z Error 2TL5THqUT.../priv/addon-existing-item/High: Addon rotate-pdf-addon-0.2.0-pre returned non-zero: 125 (addon-version: "0.2.0-pre", task: "addon-existing-item", addon-task-id: "DRja5jbyBk6-CSuxh3xPyYt-N8jYhUNDAGH-puK5aYxggYc", jobInfo: "2TL5THqUT.../priv/addon-existing-item/High", addon-name: "rotate-pdf-addon", jobId: "2TL5THqUTeN-Pyr13vvTwHL-wpqtDXjhcop-YB6vyYUPkdc", group: "priv")
docspell-joex | 2022.09.10 22:07:17:0000 [io-comp...] [WARN ] docspell.scheduler.impl.SchedulerImpl.wrapTask:275 - Task failed with permanent errorjava.lang.Exception: Addon execution failed. Do not retry, because some addons were impure.
docspell-joex |         at docspell.joex.addon.AddonTaskExtension$AddonExecutionResultOps.raiseErrorIfNeeded(AddonTaskExtension.scala:23)
docspell-joex |         at docspell.joex.addon.ItemAddonTask$.$anonfun$apply$4(ItemAddonTask.scala:45)
docspell-joex |         at cats.data.OptionT.$anonfun$flatMap$1(OptionT.scala:115)
docspell-joex |         at scala.Option.fold(Option.scala:263)
docspell-joex |         at cats.data.OptionT.$anonfun$flatMapF$1(OptionT.scala:118)
docspell-joex |         at cats.effect.IOFiber.succeeded(IOFiber.scala:1185)
docspell-joex |         at cats.effect.IOFiber.runLoop(IOFiber.scala:975)
docspell-joex |         at cats.effect.IOFiber.asyncContinueSuccessfulR(IOFiber.scala:1338)
docspell-joex |         at cats.effect.IOFiber.run(IOFiber.scala:140)
docspell-joex |         at cats.effect.unsafe.WorkerThread.run(WorkerThread.scala:549)
docspell-joex | 
docspell-joex | 2022.09.10 22:07:17:0000 [io-comp...] [WARN ] docspell.scheduler.impl.LogSink.logInternal:45 - >>> 2022-09-10T20:07:17.718386Z Warn 2TL5THqUT.../priv/addon-existing-item/High: Task failed with permanent error! (jobId: "2TL5THqUTeN-Pyr13vvTwHL-wpqtDXjhcop-YB6vyYUPkdc", task: "addon-existing-item", group: "priv", jobInfo: "2TL5THqUT.../priv/addon-existing-item/High")

I did set allow-impure = true in the rest-server.conf

eikek commented 2 years ago

Hm, for some reason the docker run command failed. Is this the only output you can find? Does maybe the ui show more? (usually everything in the ui should also be logged, but right now I'm not sure :)). If the image build successfully, then you should be able to see it in docker images - maybe try first to run it on the shell interactively, like docker run -it <image-name>. It should print out No item data json file found.

gandy92 commented 2 years ago

Actually, there is no docker image after the command fails, but since the log states that the image had been built successfully, I was under the assumption that it maybe got deleted afterwards. Actually, when I switch joex logging to DEBUG level, the log tells a different story. I've captured all debug messages beginning with "About to run 1 addon(s)" and removed all lines containing "HikariPool-1 - Reset" (almost 50%) for better readability: addon-run.log. Apparently it complains about a missing package guile-2.2. Installing the guile-package on the docker host didn't help, nor did I expect it to. What else can I try?

gandy92 commented 2 years ago

I've tried to manually docker build the addon from the git sources, and that also fails. A quick search revealed, that apparently, guile-2.2 was dropped from debian:bookworm in favor of guile-3.0. Replacing guile-2.2 with guile-3.0 allowed for a successful docker build, and a docker run yielded:

No item data json file found.

How would I have to call docker build so that the addon image can be used by the joex docker runner? Then I could check if the addon works as intended with guile-3.0

eikek commented 2 years ago

Ah that's interesting. I could built it here by removing guile-2.2 altogether. I'm not sure if guile-json works with guile-3.0, but let's see. So to test it more thoroughly, add this file into the checked out sources (somewhere):

#!/usr/bin/env bash

project_dir=$(git rev-parse --show-toplevel)

docker run \
       --mount type=bind,source="$project_dir/test",target=/mnt \
       --env ITEM_DATA_JSON=/mnt/item_data.json \
       --env OUTPUT_DIR=/mnt/tmp \
       --env ITEM_PDF_DIR=/mnt/pdf \
       rotate-pdf-addon \
       /mnt/input.json

and execute it.

Edit: on my machine it prints out this:

❯ ./test/run.sh
Processing attachment "my-file.converted.pdf"Running qpdf to rotate "90"Running pdftotext to extract the text from rotatet file

{"files":[{"itemId":"qZDnyGIAJsXr","textFiles":{"BXLaDza97A":"BXLaDza97A.txt"},"pdfFiles":{"BXLaDza97A":"BXLaDza97A.pdf"}}]}
gandy92 commented 2 years ago

After tagging the image with "rotate-pdf-addon:latest", executing the script threw a "qpdf: open #f/BXLaDza97A: No such file or directory" at me. However, running the addon from docspell now works: After reopening the file is was indeed in the intended orientation. I tried with guile-3.0 and without, both variants worked for me.

Is there any way to let docspell know it needs to update the preview for the rotated document?

eikek commented 2 years ago

The "qpdf: open #f/BXLaDza97A: No such file or directory" could be due to a typo that I fixed in an edit of the comment. Maybe you got the earlier version :/. The preview should be updated automatically - maybe try hard-reloading the page.

gandy92 commented 2 years ago

All is fine now, thank you so much for your help (and patience)! clearing the browser cache helped while being in the document search view. Switching between views also helped, I've probably been too impatient.. Now that everything works, I also found the error messages for the failed tasks in the UI, just didn't think of looking in the job execution view.. If I haven't done so already, let me thank you for the addons feature, the rotate addon alone is a big win organizing the documents.

gandy92 commented 2 years ago

One last thing: Apparently, the rotate addon does not operate on the original document, so when I mark it for reprocessing for OCR, I end up with the original orientation. Is this the intended behaviour and can I change it somehow?

eikek commented 2 years ago

This is intended, the original document is never changed. However, the addon should extract the text from the rotated pdf. Hm, maybe it is not doing ocr, have to check.