Closed efroese closed 11 years ago
I was able to make this work without ignoring the non-ascii characters with this patch:
diff --git a/lib/docsplit/info_extractor.rb b/lib/docsplit/info_extractor.rb
index 3d50d53..d878496 100644
--- a/lib/docsplit/info_extractor.rb
+++ b/lib/docsplit/info_extractor.rb
@@ -5,14 +5,14 @@ module Docsplit
# Regex matchers for different bits of information.
MATCHERS = {
- :author => /^Author:\s+([^\n]+)/,
- :date => /^CreationDate:\s+([^\n]+)/,
- :creator => /^Creator:\s+([^\n]+)/,
- :keywords => /^Keywords:\s+([^\n]+)/,
- :producer => /^Producer:\s+([^\n]+)/,
- :subject => /^Subject:\s+([^\n]+)/,
- :title => /^Title:\s+([^\n]+)/,
- :length => /^Pages:\s+([^\n]+)/,
+ :author => Regexp.new("^Author:\s+([^\n]+)".encode('UTF-8')),
+ :date => Regexp.new("^CreationDate:\s+([^\n]+)".encode('UTF-8')),
+ :creator => Regexp.new("^Creator:\s+([^\n]+)".encode('UTF-8')),
+ :keywords => Regexp.new("^Keywords:\s+([^\n]+)".encode('UTF-8')),
+ :producer => Regexp.new("^Producer:\s+([^\n]+)".encode('UTF-8')),
+ :subject => Regexp.new("^Subject:\s+([^\n]+)".encode('UTF-8')),
+ :title => Regexp.new("^Title:\s+([^\n]+)".encode('UTF-8')),
+ :length => Regexp.new("^Pages:\s+([^\n]+)".encode('UTF-8')),
}
# Pull out a single datum from a pdf.
@@ -29,4 +29,4 @@ module Docsplit
end
-end
\ No newline at end of file
+end
I think this only works on Ruby 1.9 though.
I've submitted the above workaround as a pull request at https://github.com/documentcloud/docsplit/pull/35
This pull request and suggestions were not working for me, i have a pull request with code worked for me https://github.com/documentcloud/docsplit/pull/65
This should be fixed as of 92c1e18084cdbd2ecd08d0da0fd4df2e18efff0d and 80f4dc0775fe707ded4b5a14a8f84385f7cd3e34
I was seeing a few of these errors.
/opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/info_extractor.rb:24:in
match' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/info_extractor.rb:24:in
match' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/info_extractor.rb:24:inextract' (eval):3:in
extract_length' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/image_extractor.rb:34:inconvert' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/image_extractor.rb:19:in
block (3 levels) in extract' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/image_extractor.rb:19:ineach' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/image_extractor.rb:19:in
block (2 levels) in extract' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/image_extractor.rb:18:ineach' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/image_extractor.rb:18:in
each_with_index' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/image_extractor.rb:18:inblock in extract' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/image_extractor.rb:16:in
each' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/image_extractor.rb:16:inextract' /opt/local/lib64/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit.rb:58:in
extract_images'It turns out the problem was the metadata returned from pdfinfo contained the restricted symbol (R) in some of the fields.
For example: Creator: Microsoft® Office Word 2007 Producer: Microsoft® Office Word 2007 CreationDate: Mon Jul 18 12:52:52 2011 ModDate: Mon Jul 18 12:52:52 2011 Tagged: yes Pages: 4 Encrypted: no Page size: 612 x 792 pts (letter) File size: 114279 bytes Optimized: no PDF version: 1.5
After applying this patch to remove non-ascii characters from the pdfinfo output I was able to use docsplit.
Tested on Linux with ruby 1.9.1-p431 and 1.9.2-p0