documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs
http://documentcloud.github.io/docsplit/
Other
831 stars 214 forks source link

Ignore non-ascii chars in extracted PDF info. #32

Closed efroese closed 11 years ago

efroese commented 12 years ago

I was seeing a few of these errors.

/opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/info_extractor.rb:24:in match' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/info_extractor.rb:24:inmatch' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/info_extractor.rb:24:in extract' (eval):3:inextract_length' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/image_extractor.rb:34:in convert' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/image_extractor.rb:19:inblock (3 levels) in extract' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/image_extractor.rb:19:in each' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/image_extractor.rb:19:inblock (2 levels) in extract' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/image_extractor.rb:18:in each' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/image_extractor.rb:18:ineach_with_index' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/image_extractor.rb:18:in block in extract' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/image_extractor.rb:16:ineach' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/image_extractor.rb:16:in extract' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit.rb:58:inextract_images'

It turns out the problem was the metadata returned from pdfinfo contained the restricted symbol (R) in some of the fields.

For example: Creator: Microsoft® Office Word 2007 Producer: Microsoft® Office Word 2007 CreationDate: Mon Jul 18 12:52:52 2011 ModDate: Mon Jul 18 12:52:52 2011 Tagged: yes Pages: 4 Encrypted: no Page size: 612 x 792 pts (letter) File size: 114279 bytes Optimized: no PDF version: 1.5

After applying this patch to remove non-ascii characters from the pdfinfo output I was able to use docsplit.

Tested on Linux with ruby 1.9.1-p431 and 1.9.2-p0

stuartf commented 12 years ago

I was able to make this work without ignoring the non-ascii characters with this patch:

diff --git a/lib/docsplit/info_extractor.rb b/lib/docsplit/info_extractor.rb
index 3d50d53..d878496 100644
--- a/lib/docsplit/info_extractor.rb
+++ b/lib/docsplit/info_extractor.rb
@@ -5,14 +5,14 @@ module Docsplit

     # Regex matchers for different bits of information.
     MATCHERS = {
-      :author   => /^Author:\s+([^\n]+)/,
-      :date     => /^CreationDate:\s+([^\n]+)/,
-      :creator  => /^Creator:\s+([^\n]+)/,
-      :keywords => /^Keywords:\s+([^\n]+)/,
-      :producer => /^Producer:\s+([^\n]+)/,
-      :subject  => /^Subject:\s+([^\n]+)/,
-      :title    => /^Title:\s+([^\n]+)/,
-      :length   => /^Pages:\s+([^\n]+)/,
+      :author   => Regexp.new("^Author:\s+([^\n]+)".encode('UTF-8')),
+      :date     => Regexp.new("^CreationDate:\s+([^\n]+)".encode('UTF-8')),
+      :creator  => Regexp.new("^Creator:\s+([^\n]+)".encode('UTF-8')),
+      :keywords => Regexp.new("^Keywords:\s+([^\n]+)".encode('UTF-8')),
+      :producer => Regexp.new("^Producer:\s+([^\n]+)".encode('UTF-8')),
+      :subject  => Regexp.new("^Subject:\s+([^\n]+)".encode('UTF-8')),
+      :title    => Regexp.new("^Title:\s+([^\n]+)".encode('UTF-8')),
+      :length   => Regexp.new("^Pages:\s+([^\n]+)".encode('UTF-8')),
     }

     # Pull out a single datum from a pdf.
@@ -29,4 +29,4 @@ module Docsplit

   end

-end
\ No newline at end of file
+end

I think this only works on Ruby 1.9 though.

stuartf commented 12 years ago

I've submitted the above workaround as a pull request at https://github.com/documentcloud/docsplit/pull/35

amalagaura commented 11 years ago

This pull request and suggestions were not working for me, i have a pull request with code worked for me https://github.com/documentcloud/docsplit/pull/65

knowtheory commented 11 years ago

This should be fixed as of 92c1e18084cdbd2ecd08d0da0fd4df2e18efff0d and 80f4dc0775fe707ded4b5a14a8f84385f7cd3e34