eikek / docspell

Assist in organizing your piles of documents, resulting from scanners, e-mails and other sources with miminal effort.
https://docspell.org
GNU Affero General Public License v3.0
1.58k stars 119 forks source link

Weird encoding/render characters? #1898

Closed Enjoyed closed 1 year ago

Enjoyed commented 1 year ago

Hello again!

I'm having some trouble troubleshooting or trying to find out why there's weird characters showing on the converted. Here's an example: imagen

Now, the weird thing is, i can copy-paste these icons, and it's the proper text! Here's the source file shown correctly: imagen

Even the extracted text is correct. It seems to only affect the preview/render, but when i download the file, the weird characters keep on. There's no visible warn/error/info on browser console nor on the file processing log.

It doesn't affect all. Just... some text? In this example it's the entire document, but i also have documents without any, and documents with maybe a single line of this weird text. Could it be because of docspell missing some font that these pdf use?

Thanks

Enjoyed commented 1 year ago

To add more details, here would be the log about the conversion:

Sat, December 31st, 2022, 14:14: Converting file Some(ResumenContrato2.pdf) (application/pdf) into a PDF
Sat, December 31st, 2022, 14:14: Storing input to file /tmp/docspell-convert/docspell-ocrmypdf11411127099403049982/infile for running ocrmypdf
Sat, December 31st, 2022, 14:14: Trying to read the PDF using 1 passwords
Sat, December 31st, 2022, 14:14: Running external command: ocrmypdf -l spa --skip-text --deskew -j 1 /tmp/docspell-convert/docspell-ocrmypdf11411127099403049982/infile /tmp/docspell-convert/docspell-ocrmypdf11411127099403049982/out.pdf
Sat, December 31st, 2022, 14:14: Command `ocrmypdf -l spa --skip-text --deskew -j 1 /tmp/docspell-convert/docspell-ocrmypdf11411127099403049982/infile /tmp/docspell-convert/docspell-ocrmypdf11411127099403049982/out.pdf` finished: 0
Sat, December 31st, 2022, 14:14: ocrmypdf stdout:
Sat, December 31st, 2022, 14:14: ocrmypdf stderr: 1 skipping all processing on this page 2 skipping all processing on this page Postprocessing... Optimize ratio: 1.00 savings: 0.1% Output file is a PDF/A-2B (as expected)
Sat, December 31st, 2022, 14:14: Conversion to pdf successful. Saving file.
Sat, December 31st, 2022, 14:14: Closing process: `ocrmypdf -l spa --skip-text --deskew -j 1 /tmp/docspell-convert/docspell-ocrmypdf11411127099403049982/infile /tmp/docspell-convert/docspell-ocrmypdf11411127099403049982/out.pdf`

Could it be more an issue from ocrmypdf rather than docspell?

eikek commented 1 year ago

It is hard to diagnose from here, but my first guess is that some font is missing on the system that runs joex. This would affect almost all tools that do something with PDFs and it would only affect those PDFs that don't include all the glyphs which is a bit sad imho. Then the fallback font seems to be off as well. You cuold maybe test the pdfs in question on the system by running ocrmypdf by hand and/or pdfinfo etc to look into the pdf. If you have a pdf without any sensible information that you can share, I could take a look as well.

Enjoyed commented 1 year ago

Thanks for your time!

You can get the file:

wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1dFwkZi4jUIabPkWhogmMFvBmrimrqehY' -O test.pdf

I tested it on the same docker container;

bash-5.1# ocrmypdf -l spa --skip-text --deskew -j 1 test.pdf outtest.pdf
Scanning contents: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  7.58page/s]
    1 skipping all processing on this page
    2 skipping all processing on this page
OCR: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.0/2.0 [00:00<00:00, 305.44page/s]
Postprocessing...
PDF/A conversion: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.45page/s]
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.1%
Output file is a PDF/A-2B (as expected)

I used the same command as i saw on the logs, and this result (outtest.pdf) seems to work properly. Just in case i re-uploaded this file via gui, and it seems to throw the same skips:

Sat, December 31st, 2022, 19:14: Converting file Some(test.pdf) (application/pdf) into a PDF
Sat, December 31st, 2022, 19:14: Storing input to file /tmp/docspell-convert/docspell-ocrmypdf13296477803091843268/infile for running ocrmypdf
Sat, December 31st, 2022, 19:14: Trying to read the PDF using 1 passwords
Sat, December 31st, 2022, 19:14: Running external command: ocrmypdf -l spa --skip-text --deskew -j 1 /tmp/docspell-convert/docspell-ocrmypdf13296477803091843268/infile /tmp/docspell-convert/docspell-ocrmypdf13296477803091843268/out.pdf
Sat, December 31st, 2022, 19:14: Command `ocrmypdf -l spa --skip-text --deskew -j 1 /tmp/docspell-convert/docspell-ocrmypdf13296477803091843268/infile /tmp/docspell-convert/docspell-ocrmypdf13296477803091843268/out.pdf` finished: 0
Sat, December 31st, 2022, 19:14: ocrmypdf stdout:
Sat, December 31st, 2022, 19:14: ocrmypdf stderr: 1 skipping all processing on this page 2 skipping all processing on this page Postprocessing... Optimize ratio: 1.00 savings: 0.1% Output file is a PDF/A-2B (as expected)
Sat, December 31st, 2022, 19:14: Conversion to pdf successful. Saving file.
Sat, December 31st, 2022, 19:14: Closing process: `ocrmypdf -l spa --skip-text --deskew -j 1 /tmp/docspell-convert/docspell-ocrmypdf13296477803091843268/infile /tmp/docspell-convert/docspell-ocrmypdf13296477803091843268/out.pdf`

But, sadly, the result here: imagen

Thanks for your help.

eikek commented 1 year ago

Thanks for the details! I need to take a closer look when I have more time. It could be indeed some fonts missing in the container - but then why does it work when executed manually. Hm, could it be that the fonts are not accessible to the docspell user that runs the joex process?

Enjoyed commented 1 year ago

Seems like it really is a fonts issue! Here's the most common warn:

    142 IOTLNU+ZurichSans-Light
    148 SDPARV+OpenSans-Regular
    150 QRXVCX+OpenSans-Regular
    152 JAGJIZ+MVBoli
    156 VVAXZX+VodafoneRg-Regular
    158 OPIHHQ+VodafoneRg-Regular
    168 YAHVMA+Cambria-Bold
    180 AESENE+Montserrat-Light
    192 YWFHKK+PTSans-Bold
    210 MYJLLU+Verdana
    224 HWFXOE+ArialMT
    224 PLITJB+Roboto-Light
    224 SYFWCK+ArialMT
    236 ICIHVL+Unknown
    256 EKXQCE+OpenSans-Regular
    260 QPZXUL+Montserrat-Medium
    264 ITCFRB+Munged-LWbV4xkyBu
    289 BERNUP+OpenSans-Regular
    360 SCUAFJ+PTSans-Regular
    376 DTHUUL+Roboto-Bold
    392 GAAAXG+Humnst777BT
    456 HFRDKL+OpenSans-Regular
    520 GSLSKO+OpenSans-Regular
   1022 QDSREI+OpenSans-Regular
   2304 SNPZKF+OpenSans-Regular
   2898 ECAHTO+Humnst777BT

And well, the full error as example:

 [WARN ] org.apache.pdfbox.pdmodel.font.PDType0Font - No Unicode mapping for CID+233 (233) in font AOFPHQ+Cambria
 [WARN ] org.apache.pdfbox.pdmodel.font.PDType0Font - No Unicode mapping for CID+66 (66) in font AOFPHQ+Cambria
 [WARN ] org.apache.pdfbox.pdmodel.font.PDType0Font - No Unicode mapping for CID+110 (110) in font AOFPHQ+Cambria
 [WARN ] org.apache.pdfbox.pdmodel.font.PDType0Font - No Unicode mapping for CID+243 (243) in font AOFPHQ+Cambria
 [WARN ] org.apache.pdfbox.pdmodel.font.PDType0Font - No Unicode mapping for CID+84 (84) in font DCWWQB+Cambria,Bold
 [WARN ] org.apache.pdfbox.pdmodel.font.PDType0Font - No Unicode mapping for CID+73 (73) in font DCWWQB+Cambria,Bold

Now i wonder if my pdf are faulty or there's some error with joex trying to get fonts

eikek commented 1 year ago

It is very likely a fonts issue. It cannot find the unicode to a glyph in the pdf. Probably the pdf doesn't contain it itself and the font being loaded doesn't provide it. I tested your pdf on my installation and it worked properly. I think this is only for the docker images, it misses some font that you need for your pdf. if you know which, we could adopt the images to install it. OTOH you mentioned that running the same command manually proudced a correct pdf? Then I really don't know how that can happen…

Update: I just tried your document on the docker version of docspell (current snapshot) and could not reproduce it. Strange! Do you run the provided docker environment or what's your setup?

FCUnlimited commented 1 year ago

Hi! I've got the same error at several pdf's. Docspell runs on a docker as well here.

I tried to find the difference between the files and the only difference I could finde is in the font properties: Encoding is different: Works: Integrated Works not: WinAnsiEncoding

Let me know if I can supply you with more information

eikek commented 1 year ago

I think it would help if I could get such a PDF, so I can do some research. Maybe it is possible to generate something with non-sensible information in it?

Enjoyed commented 1 year ago

Update: I just tried your document on the docker version of docspell (current snapshot) and could not reproduce it. Strange! Do you run the provided docker environment or what's your setup?

I dont modify the joex image, here's my docker-compose for joex:

joex:
    image: docspell/joex:latest
    command:
      - -J-Xmx1G
    networks: 
      - traefik-public
      - database
    environment:
      - TZ=Europe/Madrid
      - DOCSPELL_JOEX_APP__ID=joex1
      - DOCSPELL_JOEX_PERIODIC__SCHEDULER_NAME=joex1
      - DOCSPELL_JOEX_SCHEDULER_NAME=joex1
      - DOCSPELL_JOEX_BASE__URL=http://joex:7878
      - DOCSPELL_JOEX_BIND_ADDRESS=0.0.0.0
      - DOCSPELL_JOEX_FULL__TEXT__SEARCH_ENABLED=true
      - DOCSPELL_JOEX_FULL__TEXT__SEARCH_SOLR_URL=http://solr:8983/solr/docspell
      - DOCSPELL_JOEX_JDBC_PASSWORD=[REDACTED]
      - DOCSPELL_JOEX_JDBC_URL=jdbc:mariadb://mariadb:3306/docs
      - DOCSPELL_JOEX_JDBC_USER=docs
      - DOCSPELL_JOEX_ADDONS_EXECUTOR__CONFIG_RUNNER=docker,trivial
      - DOCSPELL_JOEX_CONVERT_HTML__CONVERTER=weasyprint
      - DOCSPELL_JOEX_USER__TASKS_SCAN__MAILBOX_MAIL__CHUNK__SIZE=300
      - DOCSPELL_JOEX_USER__TASKS_SCAN__MAILBOX_MAX__MAILS=20000
      - DOCSPELL_JOEX_SCHEDULER_POOL__SIZE=4
    ports:
      - "7878"
    depends_on:
      - solr
      - mariadb
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.role != manager
    volumes:
    #   - /var/run/docker.sock:/var/run/docker.sock
      - /mnt/ceph/docs/tmp:/tmp
      - /mnt/ceph/docs/docspell.conf:/opt/docspell.conf

Re-reading it, could it be due to DOCSPELL_JOEX_CONVERT_HTML__CONVERTER=weasyprint ? Additionally, i tried to install fonts to the joex container (google fonts... 1GB+ worth of fonts) without success :(

eikek commented 1 year ago

Hm, not sure what's going on. I tried this file

wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1dFwkZi4jUIabPkWhogmMFvBmrimrqehY' -O test.pdf

and I could not reproduce the issue, the converted file looked fine….The =weasyprint is only used when converting html files, I don't think it is related.

eikek commented 1 year ago

Oh wait, can you try snapshot instead of latest images? Maybe something is odd with the latest image but working on the most current one.

Enjoyed commented 1 year ago

I will guess by snapshot you mean the nightly image? https://hub.docker.com/r/docspell/joex/tags?page=1 Seems to still happen with this build :(

  joex:
    image: docspell/joex:nightly

imagen

FCUnlimited commented 1 year ago

I ve got the same weird result with the test.pdf I printed such a pdf and included all fonts into the pdf. In this case it works all fine. *printed means here printed with a pdf printer ;)

eikek commented 1 year ago

This is strange, I cannot reproduce it here on my machine. I tried the pdf from @Enjoyed (downloaded from google docs). The result looks like this (using docker of latest stable version):

Selection_051

@FCUnlimited can you maybe share a document that I could play with?

Enjoyed commented 1 year ago

Interestingly enough, i did some tests!

I installed docker on my windows machine, using the same settings (docspell.conf) and stuff, and it happens anyways! imagen imagen

Here's the docker compose (everything without volumes, so everything "new"):

#
# Docker-Compose
#
version: '3.8'
services:
  restserver:
    image: docspell/restserver:latest
    restart: unless-stopped
    container_name: restserver
    ports:
      - "7880:7880"
    environment:
      - TZ=Europe/Madrid
      - DOCSPELL_SERVER_INTERNAL__URL=http://restserver:7880
      - DOCSPELL_SERVER_ADMIN__ENDPOINT_SECRET=value2
      - DOCSPELL_SERVER_AUTH_SERVER__SECRET=value3
      - DOCSPELL_SERVER_BACKEND_JDBC_PASSWORD=dbpass
      - DOCSPELL_SERVER_BACKEND_JDBC_URL=jdbc:postgresql://db:5432/dbname
      - DOCSPELL_SERVER_BACKEND_JDBC_USER=dbuser
      - DOCSPELL_SERVER_BIND_ADDRESS=0.0.0.0
      - DOCSPELL_SERVER_FULL__TEXT__SEARCH_ENABLED=true
      - DOCSPELL_SERVER_FULL__TEXT__SEARCH_SOLR_URL=http://solr:8983/solr/docspell
      - DOCSPELL_SERVER_INTEGRATION__ENDPOINT_ENABLED=true
      - DOCSPELL_SERVER_INTEGRATION__ENDPOINT_HTTP__HEADER_ENABLED=true
      - DOCSPELL_SERVER_INTEGRATION__ENDPOINT_HTTP__HEADER_HEADER__VALUE=value1
      - DOCSPELL_SERVER_BACKEND_SIGNUP_MODE=open
      - DOCSPELL_SERVER_BACKEND_ADDONS_ENABLED=false
    depends_on:
      - solr
      - db
    command:
      - /opt/docspell.conf
    volumes:
      - ./docspell.conf:/opt/docspell.conf

  joex:
    image: docspell/joex:latest
    restart: unless-stopped
    container_name: joex
    command:
      - -J-Xmx1G
    ports:
      - "7878:7878"
    depends_on:
      - solr
      - db
    volumes:
      - ./docspell.conf:/opt/docspell.conf

  consumedir:
    image: docspell/dsc:latest
    restart: unless-stopped
    container_name: consumedir
    command:
      - dsc
      - "-d"
      - "http://restserver:7880"
      - "watch"
      - "--delete"
      - "-ir"
      - "--not-matches"
      - "**/.*"
      - "--header"
      - "Docspell-Integration:value1"
      - "/opt/docs"
    depends_on:
      - restserver

  db:
    image: postgres:15.1
    container_name: db
    restart: unless-stopped
    environment:
      - POSTGRES_USER=dbuser
      - POSTGRES_PASSWORD=dbpass
      - POSTGRES_DB=dbname

  solr:
    image: solr:9
    container_name: solr
    restart: unless-stopped
    command:
      - solr-precreate
      - docspell
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8983/solr/docspell/admin/ping"]
      interval: 1m
      timeout: 10s
      retries: 2
      start_period: 30s

and the docspell.conf:

docspell.server {
  app-name = "[redacted] Docs"
  app-id = "rest1"
  base-url = "http://localhost:7880"
  internal-url = "http://localhost:7880"
  logging {
    format = "Fancy"
    minimum-level = "Info"
    levels = {
      "docspell.oidc" = "Trace"
    }
  }
  bind {
    address = "0.0.0.0"
    port = 7880
  }
  server-options {
    enable-http-2 = false
    max-connections = 1024
    response-timeout = 45s
  }
  max-item-page-size = 2000
  max-note-length = 180
  show-classification-settings = true
  auth {
      server-secret = "value3"
      session-valid = "30 minutes"
      remember-me {
          enabled = true
          valid = "30 days"
      }
      on-account-source-conflict = "convert"
  }
  download-all {
    max-files = 5000

    # The maximum (uncompressed) size of the zip file contents.
    max-size = 5000M
  }
  openid =
  [ { enabled = false,
      display = "Authelia"
      provider = {
          provider-id = "authelia",
          client-id = "docs",
          client-secret = "[redacted]",
          scope = "openid profile groups email",
          authorize-url = "https://[redacted]/api/oidc/authorization",
          token-url = "https://[redacted]/api/oidc/token",
          user-url = "https://[redacted]/api/oidc/userinfo",
          sign-key = ""
          sig-algo = "RS256"
      },
      collective-key = "fixed:[redacted]",
      user-key = "preferred_username"
    }
  ]
  oidc-auto-redirect = false
  integration-endpoint {
    enabled = true
    priority = "low"
    allowed-ips {
      enabled = true
      ips = [
        "*.*.*.*"
      ]
    }
  }
  admin-endpoint {
    # The secret. If empty, the endpoint is disabled.
    secret = "value2"
  }
  full-text-search {
    enabled = true
    backend = "solr"
    solr = {
      url = "http://solr:8983/solr/docspell"
      commit-within = 1000
      log-verbose = false
    }
  }
  backend {
    jdbc {
      url = "jdbc:postgresql://db:5432/dbname"
      user = "dbuser"
      password = "dbpass"
    }
    database-schema = {
      run-main-migrations = true
      run-fixup-migrations = true
      repair-schema = false
    }
    signup {
      mode = "open"
    }
    files {
      chunk-size = 524288
      valid-mime-types = [ ]
      default-store = "database"
      stores = {
        database =
          { enabled = true
            type = "default-database"
          }
        filesystem =
          { enabled = false
          }

        minio =
         { enabled = false
         }
      }
    }
    addons = {
      enabled = false
    }
  }
}

So, i thought, why not use the "default" values.... but still the same :/ i dont know anymore. Here's the docker-compose default that gives me error (simply with environment and without docspell.conf volume/command)

#
# Docker-Compose
#
version: '3.8'
services:
  restserver:
    image: docspell/restserver:latest
    restart: unless-stopped
    container_name: restserver
    ports:
      - "7880:7880"
    environment:
      - TZ=Europe/Madrid
      - DOCSPELL_SERVER_INTERNAL__URL=http://restserver:7880
      - DOCSPELL_SERVER_ADMIN__ENDPOINT_SECRET=value2
      - DOCSPELL_SERVER_AUTH_SERVER__SECRET=value3
      - DOCSPELL_SERVER_BACKEND_JDBC_PASSWORD=dbpass
      - DOCSPELL_SERVER_BACKEND_JDBC_URL=jdbc:postgresql://db:5432/dbname
      - DOCSPELL_SERVER_BACKEND_JDBC_USER=dbuser
      - DOCSPELL_SERVER_BIND_ADDRESS=0.0.0.0
      - DOCSPELL_SERVER_FULL__TEXT__SEARCH_ENABLED=true
      - DOCSPELL_SERVER_FULL__TEXT__SEARCH_SOLR_URL=http://solr:8983/solr/docspell
      - DOCSPELL_SERVER_INTEGRATION__ENDPOINT_ENABLED=true
      - DOCSPELL_SERVER_INTEGRATION__ENDPOINT_HTTP__HEADER_ENABLED=true
      - DOCSPELL_SERVER_INTEGRATION__ENDPOINT_HTTP__HEADER_HEADER__VALUE=value1
      - DOCSPELL_SERVER_BACKEND_SIGNUP_MODE=open
      - DOCSPELL_SERVER_BACKEND_ADDONS_ENABLED=false
    depends_on:
      - solr
      - db
#    volumes:
#      - ./docspell.conf:/opt/docspell.conf

  joex:
    image: docspell/joex:latest
    restart: unless-stopped
    container_name: joex
    command:
      - -J-Xmx1G
    environment:
      - TZ=Europe/Madrid
      - DOCSPELL_JOEX_APP__ID=joex1
      - DOCSPELL_JOEX_PERIODIC__SCHEDULER_NAME=joex1
      - DOCSPELL_JOEX_SCHEDULER_NAME=joex1
      - DOCSPELL_JOEX_BASE__URL=http://joex:7878
      - DOCSPELL_JOEX_BIND_ADDRESS=0.0.0.0
      - DOCSPELL_JOEX_FULL__TEXT__SEARCH_ENABLED=true
      - DOCSPELL_JOEX_FULL__TEXT__SEARCH_SOLR_URL=http://solr:8983/solr/docspell
      - DOCSPELL_JOEX_JDBC_PASSWORD=dbpass
      - DOCSPELL_JOEX_JDBC_URL=jdbc:postgresql://db:5432/dbname
      - DOCSPELL_JOEX_JDBC_USER=dbuser
      - DOCSPELL_JOEX_ADDONS_EXECUTOR__CONFIG_RUNNER=docker,trivial
      - DOCSPELL_JOEX_CONVERT_HTML__CONVERTER=weasyprint
      - DOCSPELL_JOEX_USER__TASKS_SCAN__MAILBOX_MAIL__CHUNK__SIZE=300
      - DOCSPELL_JOEX_USER__TASKS_SCAN__MAILBOX_MAX__MAILS=20000
      - DOCSPELL_JOEX_SCHEDULER_POOL__SIZE=4
    ports:
      - "7878:7878"
    depends_on:
      - solr
      - db
    volumes:
      - ./docspell.conf:/opt/docspell.conf

  consumedir:
    image: docspell/dsc:latest
    restart: unless-stopped
    container_name: consumedir
    command:
      - dsc
      - "-d"
      - "http://restserver:7880"
      - "watch"
      - "--delete"
      - "-ir"
      - "--not-matches"
      - "**/.*"
      - "--header"
      - "Docspell-Integration:value1"
      - "/opt/docs"
    depends_on:
      - restserver

  db:
    image: postgres:15.1
    container_name: db
    restart: unless-stopped
    environment:
      - POSTGRES_USER=dbuser
      - POSTGRES_PASSWORD=dbpass
      - POSTGRES_DB=dbname

  solr:
    image: solr:9
    container_name: solr
    restart: unless-stopped
    command:
      - solr-precreate
      - docspell
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8983/solr/docspell/admin/ping"]
      interval: 1m
      timeout: 10s
      retries: 2
      start_period: 30s

Maybe it's something not from docspell, but from my computer?

Enjoyed commented 1 year ago

My apologies. Should have tested first thing. The issue seems to be on my browser (?). On my phone it access properly: phone

Thanks for your time!

FCUnlimited commented 1 year ago

Wow havent expected that the browser is the problem. Nice work @Enjoyed ! (Thanks to @eikek anyway ;) I'm using Firefox 108.0.1 and as I said: Ive got the same problem. Looks like it's a problem with the integrated pdf viewer, which is already known: https://www.reddit.com/r/firefox/comments/noxwav/messed_up_font_rendering_in_firefox_pdf_viewer/

The solution suggested in the link worked for me as well: Change the following firefox setting in about:config from 1 to 0 browser.display.use_document_fonts

eikek commented 1 year ago

Oh great find! Really not something I would have guessed as well. (I'm still on Firefox 107.0 :))