agl / jbig2enc

JBIG2 Encoder
Other
252 stars 86 forks source link

a patch to improve pdf creation #10

Open akryukov opened 14 years ago

akryukov commented 14 years ago

Hi,

I propose a patch which changes jbig2 behavior at two aspects. First, the files generated in the "-p" mode now retain their original names, and just the extension is changed (I use ".jbig2", but whatever else would be OK). A numerical suffix is added in case of name clashes (or for images which go from multipage tiff files). For this reason the 'basename' parameter is gone. The reason for this change is that source images may have some accompanying files (such as background images previously separated with a scan processing application). In such case file names contain some useful information which should not be lost during the processing/conversion.

The second change allows to generate more than just one symbol dictionary, so that the loading speed for large PDF files can be increased. There is now a new option (-P, --pages-per-dict), which specifies how many pages should be processed at the same pass. The default value for this parameter is 15.

I also propose a modified version of pdf.py, implementing support for background images, which can be combined with the foreground mask in the same pdf file. Several graphical formats (PNG, TIFF, JPEG) are supported. It is possible either to use graphics stripped by jbig2 at the previous stage, or prepage images separately in a different application, given that the file names follow the same convention.

BTW it might be reasonable to rename pdf.py to something more meaningful, so that the script could be safely installed somewhere into the PATH.

The files can be downloaded here: http://www.thessalonica.org.ru/downloads/jbig2.patch.gz http://www.thessalonica.org.ru/downloads/pdf.py.gz

DingoDog commented 14 years ago

I downloaded your patch, and tried patching, but these errors are returned to me:

patch -p0 < jbig2.patch

can't find file to patch at input line 4 Perhaps you used the wrong -p or --strip option?

The text leading up to this was:

|diff -ur agl-jbig2enc-git.orig//jbig2.cc agl-jbig2enc-git//jbig2.cc |--- agl-jbig2enc-git.orig//jbig2.cc 2009-11-05 11:27:45.000000000 +0300

|+++ agl-jbig2enc-git//jbig2.cc 2009-11-07 00:31:39.000000000 +0300

File to patch: jbig2.cc patching file jbig2.cc Hunk #1 FAILED at 39. Hunk #2 FAILED at 191. Hunk #3 FAILED at 304. Hunk #4 FAILED at 354. Hunk #5 FAILED at 393. Hunk #6 FAILED at 431. Hunk #7 FAILED at 571. 7 out of 7 hunks FAILED -- saving rejects to file jbig2.cc.rej

akryukov commented 14 years ago

That's my fault: the patch was prepared according to my directory tree (i. e. it assumed the unpatched sources have been downloaded into a directory called agl-jbig2enc-git). I have now uploaded a corrected version of the patch at the same location. This version should be placed directly into the directory with jbig2enc sources before you execute

patch -p0 < jbig2.patch

DingoDog commented 14 years ago

thanks (also for your fonts, specially "Old Standard", that I use)

patched, this is the output:

patch -p0<jbig2.patch

patching file jbig2.cc Hunk #1 succeeded at 37 (offset -2 lines). Hunk #3 succeeded at 302 (offset -2 lines). Hunk #5 succeeded at 391 (offset -2 lines). Hunk #6 FAILED at 429. Hunk #7 succeeded at 572 (offset -1 lines). 1 out of 7 hunks FAILED -- saving rejects to file jbig2.cc.rej

after patched I tried to build, but something is wrong

make

g++ -c jbig2enc.cc -I../leptonlib-1.58/src -Wall -I/usr/include -L/usr/lib -O3 g++ -c jbig2arith.cc -I../leptonlib-1.58/src -Wall -I/usr/include -L/usr/lib -O3 g++ -c jbig2sym.cc -DUSEEXT -I../leptonlib-1.58/src -Wall -I/usr/include -L/usr/lib -O3 ar -rcv libjbig2enc.a jbig2enc.o jbig2arith.o jbig2sym.o a - jbig2enc.o a - jbig2arith.o a - jbig2sym.o g++ -o jbig2 jbig2.cc -L. -ljbig2enc ../leptonlib-1.58/src/liblept.a -I../leptonlib-1.58/src -Wall -I/usr/include -L/usr/lib -O3 -lpng -ljpeg -ltiff -lm jbig2.cc: In function 'int main(int, char**)': jbig2.cc:501: warning: format '%s' expects type 'char', but argument 3 has type 'char ()(const char)throw ()' jbig2.cc:538: warning: format '%s' expects type 'char', but argument 3 has type 'char ()(const char)throw ()' jbig2.cc:553: warning: format '%s' expects type 'char', but argument 3 has type 'char ()(const char_)throw ()' jbig2.cc:564: error: 'pages_to_compress' was not declared in this scope jbig2.cc:567: error: 'cnt' was not declared in this scope jbig2.cc:578: error: 'cnt' was not declared in this scope jbig2.cc: At global scope: jbig2.cc:207: warning: 'char* replacesuffix(char, const char_)' defined but not used jbig2.cc:220: warning: 'char* get_page_or_dictname(char**, int, const char, int)' defined but not used jbig2.cc:273: warning: 'int is_tiffformat(int)' defined but not used make: ** [jbig2] Error 1

akryukov commented 14 years ago

DingoDog

It looks like you are attempting to patch a wrong version. You should download the most recent sources from git:

git clone git://git://github.com/agl/jbig2enc.gitgithub.com/agl/jbig2enc.git

DingoDog commented 14 years ago

Many thanks for your answer first of all

Yes, I tried to apply patch to jbig2 0.27 downloadable at:

http://github.com/agl/jbig2enc/tarball/0.27

Now, I used GIt but it seems fail:

git clone git://git://github.com/agl/jbig2enc.gitgithub.com/agl/jbig2enc.git

Initialized empty Git repository in /root/NewDir/jbig2enc/.git/ fatal: Unable to look up git (port ) (Servname not supported for ai_socktype) fetch-pack from 'git://git://github.com/agl/jbig2enc.gitgithub.com/agl/jbig2enc.git' failed.

this command has instead worked

git clone git://github.com/agl/jbig2enc.git src


git clone git://github.com/agl/jbig2enc.git src

Initialized empty Git repository in /root/NewDir/src/.git/ remote: Counting objects: 118, done. remote: Compressing objects: 100% (112/112), done. Indexing 118 objects... remote: Total 118 (delta 75), reused 0 (delta 0) 100% (118/118) done Resolving 75 deltas... 100% (75/75) done


and applying patch has been successful

I then downloaded leptonica libs 1.63 and built jbig2enc (not yet tried) I can not wait to try it! meantime thanks again for your patch and your answers

EDIT:

Tried, it is working, only, when I use your modified pdf.py it says

File "/root/my-applications/bin/thessalonica-pdf.py", line 27, in from PIL import Image

So I think I have not Python Imaging Library (PIL), it is right? I'm currently looking for this but I have yet found

EDIT:

I built PIL from sources and launched before

jbig2

and then your modified pdf.py

pdf.py *.jbig2 out>test.pdf

but resulting pdf has b/w images, I thought the pictures were in color, maybe I did not understand the meaning of your sentence:

"which can be combined with the foreground mask in the same pdf file"

how can this be done? (mixing foreground mask with b/w text) excuse me for my ignorance

mistydemeo commented 12 years ago

@akryukov, I recognize this is a very old issue, but wanted to mention I'd consider pulling this in mistydemeo/jbig2enc.

Since you've implemented both the symbol page limiting functionality and the image/text layer functionality in PDFBeads, do you think there's anything significant here that is still worth including directly in jbig2enc? I think the layer functionality is out of scope for pdf.py, since that's really just a simple demo utility - I would rather avoid adding new dependencies to it.

The symbol page feature seems useful, however. Probably still within scope of jbig2enc. If you think it's still relevant, would you mind rebasing your patch on the current master at my fork and submitting a pull request there?

Thanks!

zdenop commented 12 years ago

For http://www.thessalonica.org.ru/downloads/pdf.py.gz I got error message (File Not Found!). Can you (or somobody else who has a copy) post this file once again?

akryukov commented 12 years ago

On Fri, 29 Jun 2012 06:46:32 -0700 zdenop wrote:

For http://www.thessalonica.org.ru/downloads/pdf.py.gz I got error message (File Not Found!). Can you (or somobody else who has a copy) post this file once again?

It is obsolete and no longer needed: try pdfbeads (which works even with unpatched jbig2enc) instead.

Regards, Alexey Kryukov

Moscow State University Faculty of History

DingoDog commented 12 years ago

download from here:

http://ge.tt/7GmUTpJ/v/0

zdenop commented 12 years ago

thanks - I am aware and glad for pdfbeads. I just want to evaluate your proposed functionality and probably merge it into my fork of jbig2enc...

BTW: I am not sure if it is a good idea to use .jbig2 or jb2 extension for current jbig2 output. I was not able to read this files with stduviewer (it should be able to open and read jbig2 files). I plan to do more test on this.

akryukov commented 12 years ago

On Fri, 29 Jun 2012 11:48:44 -0700 zdenop wrote:

BTW: I am not sure if it is a good idea to use .jbig2 or jb2 extension for current jbig2 output. I was not able to read this files with stduviewer (it should be able to open and read jbig2 files). I plan to do more test on this.

Of course you are absolutely right here, but... can you propose another meaningful extension for those files?

Regards, Alexey Kryukov

Moscow State University Faculty of History

galex751 commented 6 years ago

Hi Mr Kryukov, I'd like to test your patch with -P parameter but I'm not able to download from http://www.thessalonica.org.ru/downloads/jbig2.patch.gz. Could you post the sources somewhere in order to be able to donwload?

Many Thanks Alessandro

yb85 commented 4 years ago

Dear @akryukov , I am very interested by your patch as I encounter some serious slowdown on large documents (>100p). Would it be possible to post it online ? thanks yann

DingoDog commented 4 years ago

Dear @akryukov , I am very interested by your patch as I encounter some serious slowdown on large documents (>100p). Would it be possible to post it online ? thanks yann

Sorry for delay. I uploaded here the patch:

http://ge.tt/8BllCy23

DingoDog commented 4 years ago

Hi Mr Kryukov, I'd like to test your patch with -P parameter but I'm not able to download from http://www.thessalonica.org.ru/downloads/jbig2.patch.gz. Could you post the sources somewhere in order to be able to donwload?

Many Thanks Alessandro

I sent a full pack with patch and other goodies to mail address you provided to me on diybookscanner forum

useretail commented 3 years ago

could you guys re-upload the patch please?

DingoDog commented 3 years ago

I reuploaded patch on my site:

http://dokupuppylinux.info/media/jbig2.patch.zip

useretail commented 3 years ago

mirror: https://pastebin.com/raw/WT4TwUxZ

jaumegs commented 1 year ago

@DingoDog

It's possible for you to re-upload the latest version of "pdf.py.gz" before you obsoleted it in favor of PDFBeads?

I'm interested in the other goodies from diybookscanner forum too... grinning

Thank you.

DingoDog commented 1 year ago

@DingoDog

It's possible for you to re-upload the latest version of "pdf.py.gz" before you obsoleted it in favor of PDFBeads?

I'm interested in the other goodies from diybookscanner forum too... grinning

Thank you.

Sure. Here the code of modified pdf.py


import sys import re import struct import glob import os

This is a very simple script to make a PDF file out of the output of a

multipage symbol compression.

Run ./jbig2 -s -p image1.jpeg image1.jpeg ...

python pdf.py output > out.pdf

class Ref: def init(self, x): self.x = x def str(self): return "%d 0 R" % self.x

class Dict: def init(self, values = {}): self.d = {} self.d.update(values)

def str(self): s = ['<< '] for (x, y) in self.d.items(): s.append('/%s ' % x) s.append(str(y)) s.append("\n") s.append(">>\n")

return ''.join(s)

global_next_id = 1

class Obj: next_id = 1 def init(self, d = {}, stream = None): global global_next_id

if stream is not None:
  d['Length'] = str(len(stream))
self.d = Dict(d)
self.stream = stream
self.id = global_next_id
global_next_id += 1

def str(self): s = [] s.append(str(self.d)) if self.stream is not None: s.append('stream\n') s.append(self.stream) s.append('\nendstream\n') s.append('endobj\n')

return ''.join(s)

class Doc: def init(self): self.objs = [] self.pages = []

def add_object(self, o): self.objs.append(o) return o

def add_page(self, o): self.pages.append(o) return self.add_object(o)

def str(self): a = [] j = [0] offsets = []

def add(x):
  a.append(x)
  j[0] += len(x) + 1
add('%PDF-1.4')
for o in self.objs:
  offsets.append(j[0])
  add('%d 0 obj' % o.id)
  add(str(o))
xrefstart = j[0]
a.append('xref')
a.append('0 %d' % (len(offsets) + 1))
a.append('0000000000 65535 f ')
for o in offsets:
  a.append('%010d 00000 n ' % o)
a.append('')
a.append('trailer')
a.append('<< /Size %d\n/Root 1 0 R >>' % (len(offsets) + 1))
a.append('startxref')
a.append(str(xrefstart))
a.append('%%EOF')

# sys.stderr.write(str(offsets) + "\n")

return '\n'.join(a)

def ref(x): return '%d 0 R' % x

def main(symboltable='symboltable', pagefiles=glob.glob('page-*')): doc = Doc() doc.add_object(Obj({'Type' : '/Catalog', 'Outlines' : ref(2), 'Pages' : ref(3)})) doc.add_object(Obj({'Type' : '/Outlines', 'Count': '0'})) pages = Obj({'Type' : '/Pages'}) doc.add_object(pages) symd = doc.add_object(Obj({}, file(symboltable, 'r').read())) page_objs = []

for p in pagefiles: try: contents = file(p).read() except IOError: sys.stderr.write("error reading page file %s\n"% p) continue (width, height) = struct.unpack('>II', contents[11:19]) xobj = Obj({'Type': '/XObject', 'Subtype': '/Image', 'Width': str(width), 'Height': str(height), 'ColorSpace': '/DeviceGray', 'BitsPerComponent': '1', 'Filter': '/JBIG2Decode', 'DecodeParms': ' << /JBIG2Globals %d 0 R >>' % symd.id}, contents) contents = Obj({}, 'q %d 0 0 %d 0 0 cm /Im1 Do Q' % (width, height)) resources = Obj({'ProcSet': '[/PDF /ImageB]', 'XObject': '<< /Im1 %d 0 R >>' % xobj.id}) page = Obj({'Type': '/Page', 'Parent': '3 0 R', 'MediaBox': '[ 0 0 %d %d ]' % (width, height), 'Contents': ref(contents.id), 'Resources': ref(resources.id)}) [doc.add_object(x) for x in [xobj, contents, resources, page]] page_objs.append(page)

pages.d.d['Count'] = str(len(page_objs))
pages.d.d['Kids'] = '[' + ' '.join([ref(x.id) for x in page_objs]) + ']'

print str(doc)

def usage(script, msg): if msg: sys.stderr.write("%s: %s\n"% (script, msg)) sys.stderr.write("Usage: %s [file_basename] > out.pdf\n"% script) sys.exit(1)

if name == 'main':

if len(sys.argv) == 2: sym = sys.argv[1] + '.sym' pages = glob.glob(sys.argv[1] + '.[0-9]') elif len(sys.argv) == 1: sym = 'symboltable' pages = glob.glob('page-') else: usage(sys.argv[0])

if not os.path.exists(sym): usage("symbol table %s not found!"% sym) elif len(pages) == 0: usage("no pages found!")

main(sym, pages)

Mark-Joy commented 9 months ago

@DingoDog Could you please re-upload jbig2.patch.gz?