kdePackages.skanpage does not support OCR even though tesseract is installed

NixOS / nixpkgs

Nix Packages collection & NixOS

MIT License

18.09k stars 14.08k forks source link

kdePackages.skanpage does not support OCR even though tesseract is installed #315039

Open devurandom opened 5 months ago

devurandom commented 5 months ago

Describe the bug

I have kdePackages.skanpage and tesseract installed.

Tesseract sees the language files:

❯ tesseract --list-langs | wc -l
130

Skanpage cannot OCR my document for the reason of missing language files:

Steps To Reproduce

Steps to reproduce the behavior:

Install kdePackages.skanpage and tesseract
Scan a document using Skanpage
Click "Export PDF"

Expected behavior

Skanpage should be able to OCR my document.

Notify maintainers

@schuelermine, @ilya-fedin, @LunNova, @mjm, @NickCao, @ttuegel

Metadata

❯ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 6.9.1, NixOS, 24.05 (Uakari), 24.05.20240524.d12251e`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.18.2`
 - channels(root): `"nixos"`
 - nixpkgs: `/nix/store/hp43s4p11vbq7qfw6v8w32vlfa9z9mry-source`

Add a :+1: reaction to issues you find important.

schuelermine commented 5 months ago

I don’t believe this is an error in tesseract packaging, but probably an oversight in skanpage or KDE packaging.

eclairevoyant commented 5 months ago

Yeah it's a clunky design. You'll have to do something like pkgs.kdePackages.skanpage.override { tesseractLanguages = [ "eng" ]; }

I guess the languages are listed at https://github.com/NixOS/nixpkgs/blob/20d5e902db240050f9fe1ee627f4a0168193c52a/pkgs/applications/graphics/tesseract/languages.nix#L158-L286

Installing tesseract separately will do nothing here.

devurandom commented 5 months ago

Yeah it's a clunky design. You'll have to do something like pkgs.kdePackages.skanpage.override { tesseractLanguages = [ "eng" ]; }

This will rebuild skanpage, right? Is there an option to change this in nixpkgs, to save me (and probably others who want to use skanpage with OCR) the rebuilds?

I tried the following (and several variations with = and without, with different names instead of pkgs.tesseract.languages that I saw in the tesseract module, ...) in my configuration, but could not get Nix to accept it:

  pkgs.kdePackages.skanpage.override = {
    tesseractLanguages = pkgs.tesseract.languages;
  };

I guess it is not supposed to be done this way. Could you please help and tell me how to set this correctly in my NixOS configuration?

eclairevoyant commented 5 months ago

This will rebuild skanpage, right?

Yeah, that's why I find it clunky... it would be nice to have a wrapper package instead to prevent such rebuilds.

I tried the following (and several variations

tesseractLanguages accepts a list of strings, while pkgs.tesseract.languages is an attrset. Normally you would select the individual languages you want. If you really want to have all languages available, you could do something like tesseractLanguages = builtins.attrNames pkgs.tesseract.languages, though I don't know the disk requirements of such a setup offhand.