added: extract pdf - Githubissues

GoogleCodeExporter commented 8 years ago

extract pdf files.
uses pyPdf as dependency.
Integrated this in preprocessing module.

Original issue reported on code.google.com by hjebb...@gmail.com on 8 Nov 2011 at 12:52

GoogleCodeExporter commented 8 years ago

I have found actually that pypdf is not great for this, a few more pdf files I 
tried are not usable. Now experimenting with pdfminer instead, it seems more 
promising. I will provide an update on this soon, hopefully before the 2.1.0 
final release.

Original comment by mjg1964 on 13 Dec 2011 at 3:47

GoogleCodeExporter commented 8 years ago

you use ReportLab for outgoing pdf-files.
it seems to be the one with most features, and best maintained.
probably you already checked it, but what are the reeasons not to use that?

henk-jan

Original comment by hjebb...@gmail.com on 13 Dec 2011 at 12:01

GoogleCodeExporter commented 8 years ago

Reportlab is for generating pdf. They also have PageCatcher, but it seems to be 
VERY low-profile.
henk-jan

Original comment by hjebb...@gmail.com on 13 Dec 2011 at 12:21

GoogleCodeExporter commented 8 years ago

I have looked at a few packages now, pdfminer seems to be the best. It can 
return each text element with it's x,y coordinates, I am using this to generate 
a csv. It is still limited by the pdf content and how it was generated and will 
usually require further parsing in the mapping script. If the pdf is an "image" 
then you're out of luck. This should be considered a last resort if you can't 
get the file in some other format.

http://phaseit.net/claird/comp.text.pdf/PDF_converters.html
http://www.itworld.com/software/102862/friends-dont-let-friends-extract-pdf-cont
ent

I will clean it up a bit and do some more testing today, then post it here.

Kind Regards,
Mike

Original comment by mjg1964 on 13 Dec 2011 at 9:06

GoogleCodeExporter commented 8 years ago

Ok here it is. This has a completely new extractpdf function that uses 
pdfminer. The x,y coordinates of text are used to sort them into rows & columns 
in csv format. Page & line numbers are first 2 fields. From there you can use a 
generic csv grammar and mapping. Also attached a sample pdf order and the 
corresponding extracted csv.

I have tried a lot of different input files, only found a few that couldn't be 
cleanly extracted.

Kind Regards,
Mike

Original comment by mjg1964 on 14 Dec 2011 at 10:25

Attachments:

GoogleCodeExporter commented 8 years ago

I have encoded the csv output as utf-8, as recommended in 
http://docs.python.org/library/csv.html#examples

The csv module doesn’t directly support reading and writing Unicode, but it 
is 8-bit-clean save for some problems with ASCII NUL characters. So you can 
write functions or classes that handle the encoding and decoding for you as 
long as you avoid encodings like UTF-16 that use NULs. UTF-8 is recommended.

Original comment by mjg1964 on 19 Dec 2011 at 6:48

Attachments:

preprocess.py

GoogleCodeExporter commented 8 years ago

hi MIke,

the charset of the csv that is generated is not the problem.
what is not clear to me is what charset is used for reading and writing 'raw' 
format?

Original comment by hjebb...@gmail.com on 19 Dec 2011 at 12:24

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

Hi henk-jan,
I'm not sure where the problem would be? (admittedly I don't fully understand 
the use of charsets). If you are using extractexcel or extractpdf it should be 
ok, these don't need to be "raw" but can have a csv grammar. If you are just 
taking raw input straight to/from the mapping script (no grammar), maybe some 
decoding or encoding would need to be done in the mapping script itself. Do you 
have an example that would require this?

Original comment by mjg1964 on 20 Dec 2011 at 5:15

GoogleCodeExporter commented 8 years ago

Added an error for PDF with no text, maybe just image(s).

Original comment by mjg1964 on 20 Dec 2011 at 5:32

Attachments:

preprocess.py

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

'charsets' (I just keep calling them that way)  is a very specific problem.
check some examples:
- ascii - every char is represented by one byte (nothing above char 127)
- iso-8859-1 - every char is represented by one byte; above 127 are the special 
characters (éãï etc) suited for werst-euopean languages.
- utf-8: ascii-compatible: the first 127 char are like ascii (every one of 
these chars is one byte) above this char are represented by 2, 3, or 4 bytes.
- utf-16: every char is represented by 2 bytes (some microsoft system outputted 
xml in utf-16; it is really used!)

problem is for eg outgoing: how are you saving this? as some charset (if text), 
or a 'bytestream'.
(you can not mix these)

Original comment by hjebb...@gmail.com on 20 Dec 2011 at 9:47

GoogleCodeExporter commented 8 years ago

hi mike, 
committed this with the lastest changes you made.
is it possible for you to make a plugin for this?

henk-jan

Original comment by hjebb...@gmail.com on 27 Dec 2011 at 5:09

GoogleCodeExporter commented 8 years ago

Yes I will try to make a plugin. Will probably be a pdf order to idoc 
conversion, as I need to do this anyway for a project at work.

Original comment by mjg1964 on 28 Dec 2011 at 12:00

GoogleCodeExporter commented 8 years ago

I have made a demo plugin for this.

Kind Regards,
Mike

Original comment by mjg1964 on 30 Dec 2011 at 10:18

Changed title: added: extract pdf

GoogleCodeExporter commented 8 years ago

Updated with channels renamed to match plugin name.

Original comment by mjg1964 on 31 Dec 2011 at 7:35

Attachments:

demo_pdf2idoc.zip

GoogleCodeExporter commented 8 years ago

Original comment by hjebb...@gmail.com on 22 Jun 2012 at 3:30

Changed state: Fixed

GoogleCodeExporter commented 8 years ago

Original comment by hjebb...@gmail.com on 10 Sep 2013 at 12:44

Changed state: Done

dmanty45 / bots

added: extract pdf #100