DIYBookScanner / spreads

Modular workflow assistant for book digitization
GNU Affero General Public License v3.0
127 stars 53 forks source link

Using spreads for postprocessing only #195

Open afsartori opened 9 years ago

afsartori commented 9 years ago

Hi,

It's been ages since I last used Spreads and I am glad to see that the project is still in active development and offering a lot of new features!

However, I cannot figure out how to use Spreads to post-process (with tesseract and pdfbeads) existing images not captured using the program. I have tried two different routes, both failed: 1-) post-processing the output of scantailor (.tif files):

$ spread --loglevel debug --verbose postprocess out
Workflow: Initializing workflow out1
bagit: Adding path /tmp/tmpxaqkKY/bag-info.txt to payload
bagit: Adding path /tmp/tmpxaqkKY/bag-info.txt to payload
bagit: Adding path /tmp/tmpxaqkKY/bag-info.txt to payload
bagit: Adding path /tmp/tmpxaqkKY/bag-info.txt to payload
bagit: Copying path out1/IMG_0117_right.tif to paylod directory
bagit: Copying path out1/IMG_0022_right.tif to paylod directory
.
.
.
bagit: Copying path out1/IMG_0182_left.tif to paylod directory
bagit: Adding path /tmp/tmpxaqkKY/bag-info.txt to payload
bagit: Adding path /home/asartori/bookscan/JConch/vol1/out1/bag-info.txt to payload
spreads encountered an error:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/spreads/main.py", line 321, in main
    run()
  File "/usr/local/lib/python2.7/dist-packages/spreads/main.py", line 308, in run
    args.subcommand(config)
  File "/usr/local/lib/python2.7/dist-packages/spreads/cli.py", line 358, in postprocess
    workflow = spreads.workflow.Workflow(config=config, path=path)
  File "/usr/local/lib/python2.7/dist-packages/spreads/workflow.py", line 445, in __init__
    for img in (self.path/'data'/'raw').iterdir()]
  File "/usr/local/lib/python2.7/dist-packages/pathlib.py", line 982, in iterdir
    for name in self._accessor.listdir(self):
  File "/usr/local/lib/python2.7/dist-packages/pathlib.py", line 346, in wrapped
    return strfunc(str(pathobj), *args)
OSError: [Errno 2] No such file or directory: 'out1/data/raw'

Trying to fix this by generating the missing folder does not work:

$ mkdir out1/data/raw
$ spread --loglevel debug --verbose postprocess out1
Workflow: Initializing workflow out1
bagit: Adding path /home/asartori/bookscan/JConch/vol1/out1/bag-info.txt to payload
bagit: Adding path out1/config.yml to payload
Workflow: Starting postprocessing...%
Workflow: Running 'process' hooks
spreadsplug.tesseract: Performing OCR
spreadsplug.tesseract: Language is "chi_sim"
bagit: Path out1/data/done is an empty directory , will be skipped.
bagit: Adding path /home/asartori/bookscan/JConch/vol1/out1/bag-info.txt to payload
bagit: Adding path out1/pagemeta.json to payload
Workflow: Done with postprocessing!

OCR was not performed, but Spreads exits without error.

2-) post-processing the JPGs from my cameras trying to invoke scantailor via spreads:

$ spread --verbose postprocess vol2

This results in the same error as scenario 1 (OSError: [Errno 2] No such file or directory: 'vol2/data/raw')

After creating the missing folder, the new output reveals that scantailor is not being invoked correctly by Spreads:

spread --verbose postprocess vol2
Workflow: Initializing workflow vol2
bagit: Adding path /home/asartori/bookscan/JConch/vol2/bag-info.txt to payload
bagit: Adding path /home/asartori/bookscan/JConch/vol2/bag-info.txt to payload
bagit: Adding path /home/asartori/bookscan/JConch/vol2/bag-info.txt to payload
bagit: Adding path vol2/config.yml to payload
Workflow: Starting postprocessing...%
Workflow: Running 'process' hooks
spreadsplug.scantailor: Generating ScanTailor configuration
spreadsplug.scantailor: /usr/bin/scantailor-cli --start-filter=2 --end-filter=5 --layout=1.5 -o=/tmp/tmpX_hohD.ScanTailor --margins-top=2.5 --margins-right=2.5 --margins-bottom=2.5 --margins-left=2.5 /tmp/st-out4zVnEq

Scan Tailor is a post-processing tool for scanned pages.
Version: 0.9.11.1

ScanTailor usage: 
    1) scantailor
    2) scantailor <project_file>
    3) scantailor-cli [options] <image, image, ...> <output_directory>
    4) scantailor-cli [options] <project_file> [output_directory]

1)
    start ScanTailor's GUI interface
2)
    start ScanTailor's GUI interface and load project file
3)
    batch processing images from command line; no GUI
4)
    batch processing project from command line; no GUI
    if output_directory is specified as last argument, it overwrites the one in project file

Options:
    --help, -h
    --verbose, -v
    --layout=, -l=<0|1|1.5|2>       -- default: 0
              0: auto detect
              1: one page layout
            1.5: one page layout but cutting is needed
              2: two page layout
    --layout-direction=, -ld=<lr|rl>    -- default: lr
    --orientation=<left|right|upsidedown|none>
                        -- default: none
    --rotate=<0.0...360.0>          -- it also sets deskew to manual mode
    --deskew=<auto|manual>          -- default: auto
    --content-detection=<cautious|normal|aggressive>
                        -- default: normal
    --content-box=<<left_offset>x<top_offset>:<width>x<height>>
                        -- if set the content detection is se to manual mode
                           example: --content-box=100x100:1500x2500
    --margins=<number>          -- sets left, top, right and bottom margins to same number.
        --margins-left=<number>
        --margins-right=<number>
        --margins-top=<number>
        --margins-bottom=<number>
    --alignment=center          -- sets vertical and horizontal alignment to center
        --alignment-vertical=<top|center|bottom>
        --alignment-horizontal=<left|center|right>
    --dpi=<number>              -- sets x and y dpi. default: 600
        --dpi-x=<number>
        --dpi-y=<number>
    --output-dpi=<number>           -- sets x and y output dpi. default: 600
        --output-dpi-x=<number>
        --output-dpi-y=<number>
    --color-mode=<black_and_white|color_grayscale|mixed>
                        -- default: black_and_white
    --white-margins             -- default: false
    --normalize-illumination        -- default: false
    --threshold=<n>             -- n<0 thinner, n>0 thicker; default: 0
    --despeckle=<off|cautious|normal|aggressive>
                        -- default: normal
    --dewarping=<off|auto>          -- default: off
    --depth-perception=<1.0...3.0>      -- default: 2.0
    --start-filter=<1...6>          -- default: 4
    --end-filter=<1...6>            -- default: 6
    --output-project=, -o=<project_name>

spreadsplug.tesseract: Performing OCR%
spreadsplug.tesseract: Language is "chi_sim"
bagit: Path vol2/data/done is an empty directory , will be skipped.
bagit: Adding path /home/asartori/bookscan/JConch/vol2/bag-info.txt to payload
bagit: Adding path vol2/pagemeta.json to payload
Workflow: Done with postprocessing!

I suspect the problem is possibly just to do with the project folder structure that Spreads expects to find (created during the capture workflow that I am skipping). Any ideas on how I could fix this would be greatly appreciated!

adongy commented 9 years ago

There was a recent patch to use spreads with older workflows, can you try it? I can't find it again, but it was fairly recent'

afsartori commented 9 years ago

Thank you for your suggestion. I ended up adapting parts of spreads' source code to achieve what I need via a separate python script.

On 12 June 2015 at 12:23, Anthony Dong notifications@github.com wrote:

There was a recent patch to use spreads with older workflows, can you try it? I can't find it again, but it was fairly recent'

— Reply to this email directly or view it on GitHub https://github.com/DIYBookScanner/spreads/issues/195#issuecomment-111457448 .

alclary commented 9 years ago

I, as well, am trying to use spreads for strictly its post-processing tool chain. Yet, like the issue exhibited above, I can not find a way to initiate the post-processing process. I have manually created a /data/raw directory for unprocessed .tif(s). How does the capture process signal to spread to initiate post-processing. Is there a way I can spoof this signal so that I can initiate post-processing without actually using spread's capture?

I would really like to see this as a supported feature. I am not a developer, but it appears to me a fairly simple implementation? Could there be an option to upload images to a workflow's raw folder, and then initiate from the post-processing step, skipping the capture step?

EDIT: I just realized I can initiate the processing stage via the API, but it still does not process anything. What is the minimum amount of meta data necessary for post-processing to proceed correctly?