abrom / henkei

Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
http://github.com/abrom/henkei
MIT License
74 stars 14 forks source link

Encoding::UndefinedConversionError in v1.23 when used in rails #11

Closed gsar closed 4 years ago

gsar commented 4 years ago

Processing PDFs fails like this now, in a project that uses rails:

[5] pry(main)> Henkei.new('https://example.com/a-file.pdf').metadata
#<Thread:0x00007fc7b42c4768@/Users/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/open3.rb:347 run> terminated with exception (report_on_exception is true):
Traceback (most recent call last):
        1: from /Users/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/open3.rb:347:in `block (2 levels) in capture2'
/Users/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/open3.rb:347:in `read': closed stream (IOError)
Exception in thread "main" org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:122)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:152)
Encoding::UndefinedConversionError: "\xE2" from ASCII-8BIT to UTF-8
from /Users/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/open3.rb:353:in `write'

If I use v1.22, everything is fine. Any ideas?

abrom commented 4 years ago

Thanks @gsar. Yes, I'm aware of the issue and have been working on a fix but hadn't found a test file that could replicate it:

https://github.com/abrom/henkei/tree/fix-source-encoding-bug

The issue was introduced after 1.22 was released by a change which was attempting to fix other file parse issues:

https://github.com/abrom/henkei/commit/66807cc90fcadd5b779d32df800885ecfa10a033

abrom commented 4 years ago

Moving forward, are you able to try out the above branch and see if that fixes your issue?

Also, if it does, are you able to provide me with a file that demonstrates the issue (ideally one that I could include in the spec suite to ensure we don't get regressions going forward).

gsar commented 4 years ago

@abrom i will check your pending fix soon. looking at the diff, i have a couple of thoughts:

failures in fix-source-encoding-bug branch when running bundle exec rspec spec:

Henkei
  initialized with a given URI                                                                                                             
    #metadata reads metadata                                                                                                               
    #text reads text                                                                                                                       
  .new                                                                                                                                     
    accepts a URI                                                                                                                          
    accepts a path with spaces                                                                                                             
    refuses other objects                                                                                                                  
    accepts a stream or object that can be read                                                                                            
    accepts a root path                                                                                                                    
    refuses a path to a missing file                                                                                                       
    accepts a relative path                                                                                                                
    requires parameters                                                                                                                    
  initialized with a ASCII encoded Tempfile                                                                                                
    #text reads text                                                                                                                       
    #metadata reads metadata                                                                                                               
  initialized with a given path                                                                                                            
    #metadata reads metadata                                                                                                               
    #text reads text                                                                                                                       
    when passing in the `pipe-error.png` test file                                                                                         
      #mimetype returns `image/png`                                                                                                        
      #html returns an empty body (FAILED - 1)                                                                                             
      #text returns an empty result (FAILED - 2)                                                                                           
  working as server mode                                                                                                                   
Successfully started tika-app's server on port: 9293                                                                                       
WARNING: The server option in tika-app is deprecated and will be removed                                                                   
by Tika 2.0 if not shortly after Tika 1.14.                                                                                                
Please migrate to the JAX-RS tika-server package.                                                                                          
See https://wiki.apache.org/tika/TikaJAXRS for usage.                                                                                      
    #starts and kills server                                                                                                               
Successfully started tika-app's server on port: 9293                                                                                       
WARNING: The server option in tika-app is deprecated and will be removed                                                                   
by Tika 2.0 if not shortly after Tika 1.14.                                                                                                
Please migrate to the JAX-RS tika-server package.                                                                                          
See https://wiki.apache.org/tika/TikaJAXRS for usage.                                                                                      
    #runs samples through server mode                                                                                                      
  initialized with a given stream                                                                                                          
    #text reads text                                                                                                                       
    #metadata reads metadata                                                                                                               
  .java                                                                                                                                    
    with no specified JAVA_HOME                                                                                                            
    with a specified JAVA_HOME                                                                                                             
  .creation_date                                                                                                                           
    should return Time                                                                                                                     
  .read                                                                                                                                    
    reads text                                                                                                                             
    reads metadata                                                                                                                         
    reads mimetype                                                                                                                         
    reads metadata values with colons as strings                                                                                           
    when passing in the `pipe-error.png` test file                                                                                         
      returns an empty result (FAILED - 3)                                                                                                 
    when passing in the `actually-a-doc.jpg` test file
      reads mimetype
      returns document content

Failures:                                                                                                                                  

  1) Henkei initialized with a given path when passing in the `pipe-error.png` test file #html returns an empty body                       
     Failure/Error: expect(henkei.html).to include '<body/>'                                                                               

       expected "<html xmlns=\"http://www.w3.org/1999/xhtml\">\n<head>\n<meta name=\"Transparency Alpha\" content=\"n...&gt; &gt;\n\nwm\nRichmond\n\n&deg;\n(a9)\nSteveston\n&gt; Placed\n\nSeafgir\n</div>\n</body></html>" to include "<body/>"                                     
       Diff:                                                                                                                               
       @@ -1,2 +1,80 @@                                                                                                                    
       -<body/>                                                                                                                            
       +<html xmlns="http://www.w3.org/1999/xhtml">                                                                                        
       +<head>                                                                                                                             
       +<meta name="Transparency Alpha" content="none"/>                                                                                   
       +<meta name="tiff:ImageLength" content="542"/>                                                                                      
       +<meta name="Compression CompressionTypeName" content="deflate"/>                                                                   
       +<meta name="Data BitsPerSample" content="8 8 8"/>                                                                                  
       +<meta name="Data PlanarConfiguration" content="PixelInterleaved"/>                                                                 
       +<meta name="Dimension VerticalPixelSize" content="0.17639795"/>                                                                    
       +<meta name="IHDR" content="width=792, height=542, bitDepth=8, colorType=RGB, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none"/>                                                                                                                        
       +<meta name="iCCP" content="profileName=ICC Profile, compressionMethod=deflate"/>                                                   
       +<meta name="Chroma ColorSpaceType" content="RGB"/>                                                                                 
       +<meta name="tiff:BitsPerSample" content="8 8 8"/>                                                                                  
       +<meta name="Content-Type" content="image/png"/>                                                                                    
       +<meta name="height" content="542"/>                                                                                                
       +<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>                                                          
       +<meta name="X-Parsed-By" content="org.apache.tika.parser.ocr.TesseractOCRParser"/>                                                 
       +<meta name="X-Parsed-By" content="org.apache.tika.parser.image.ImageParser"/>                                                      
       +<meta name="pHYs" content="pixelsPerUnitXAxis=5669, pixelsPerUnitYAxis=5669, unitSpecifier=meter"/>                                
       +<meta name="Text TextEntry" content="keyword=XML:com.adobe.xmp, value=<x:xmpmeta xmlns:x=&quot;adobe:ns:meta/&quot; x:xmptk=&quot;XMP Core 5.4.0&quot;>                                                                                                                       
       +   <rdf:RDF xmlns:rdf=&quot;http://www.w3.org/1999/02/22-rdf-syntax-ns#&quot;>                                                     
       +      <rdf:Description rdf:about=&quot;&quot;                                                                                      
       +            xmlns:exif=&quot;http://ns.adobe.com/exif/1.0/&quot;>                                                                  
       +         <exif:PixelXDimension>792</exif:PixelXDimension>                                                                          
       +         <exif:PixelYDimension>542</exif:PixelYDimension>                                                                          
       +      </rdf:Description>                                                                                                           
       +   </rdf:RDF>                                                                                                                      
       +</x:xmpmeta>, language=, compression=none"/>                                                                                       
       +<meta name="Dimension PixelAspectRatio" content="1.0"/>                                                                            
       +<meta name="iTXt iTXtEntry" content="keyword=XML:com.adobe.xmp, compressionFlag=false, compressionMethod=0, languageTag=, translatedKeyword=, text=<x:xmpmeta xmlns:x=&quot;adobe:ns:meta/&quot; x:xmptk=&quot;XMP Core 5.4.0&quot;>                                          
       +   <rdf:RDF xmlns:rdf=&quot;http://www.w3.org/1999/02/22-rdf-syntax-ns#&quot;>                                                     
       +      <rdf:Description rdf:about=&quot;&quot;                                                                                      
       +            xmlns:exif=&quot;http://ns.adobe.com/exif/1.0/&quot;>                                                                  
       +         <exif:PixelXDimension>792</exif:PixelXDimension>                                                                          
       +         <exif:PixelYDimension>542</exif:PixelYDimension>                                                                          
       +      </rdf:Description>                                                                                                           
       +   </rdf:RDF>                                                                                                                      
       +</x:xmpmeta>"/>                                                                                                                    
       +<meta name="Compression NumProgressiveScans" content="1"/>                                                                         
       +<meta name="Dimension HorizontalPixelSize" content="0.17639795"/>
       +<meta name="Chroma BlackIsZero" content="true"/>  
       +<meta name="Compression Lossless" content="true"/>
       +<meta name="width" content="792"/>
       +<meta name="Dimension ImageOrientation" content="Normal"/>
       +<meta name="tiff:ImageWidth" content="792"/>
       +<meta name="Chroma NumChannels" content="3"/>
       +<meta name="Data SampleFormat" content="UnsignedIntegral"/>
       +<title></title>
       +</head>
       +<body><div class="ocr"> 
       + 
       +  
       + 
       +   
       +  
       +   
       +
       +PING &amp;&cent;
       +
       +Vanco
       +7 gorth a
       +surrard Inlet re Vancouve yo Anmore
       +(oO Coquitlam
       +Dowesown os q
       +
       +&copy; Wan
       +West Side
       +Sealielone&gt; &gt;
       +
       +wm
       +Richmond
       +
       +&deg;
       +(a9)
       +Steveston
       +&gt; Placed
       +
       +Seafgir
       +</div>
       +</body></html>

     # ./spec/henkei_spec.rb:169:in `block (4 levels) in <top (required)>'

  2) Henkei initialized with a given path when passing in the `pipe-error.png` test file #text returns an empty result
     Failure/Error: expect(henkei.text).to eq ''

       expected: ""
            got: " \n \n  \n \n   \n  \n   \n\nPING &\xC2\xA2\n\nVanco\n7 gorth a\nsurrard Inlet re Vancouve yo Anmore...Wan\nWest Side\nSealielone> >\n\nwm\nRichmond\n\n\xC2\xB0\n(a9)\nSteveston\n> Placed\n\nSeafgir\n\n"

       (compared using ==)

       Diff:
       @@ -1 +1,30 @@
       + 
       + 
       +  
       + 
       +   
       +  
       +   
       +
       +PING &??
       +
       +Vanco
       +7 gorth a
       +surrard Inlet re Vancouve yo Anmore
       +(oO Coquitlam
       +Dowesown os q
       +
       +?? Wan
       +West Side
       +Sealielone> >
       +
       +wm
       +Richmond
       +
       +??
       +(a9)
       +Steveston
       +> Placed
       +
       +Seafgir

     # ./spec/henkei_spec.rb:165:in `block (4 levels) in <top (required)>'

  3) Henkei.read when passing in the `pipe-error.png` test file returns an empty result
     Failure/Error: expect(text).to eq ''

       expected: ""
            got: " \n \n  \n \n   \n  \n   \n\nPING &\xC2\xA2\n\nVanco\n7 gorth a\nsurrard Inlet re Vancouve yo Anmore...Wan\nWest Side\nSealielone> >\n\nwm\nRichmond\n\n\xC2\xB0\n(a9)\nSteveston\n> Placed\n\nSeafgir\n\n"

       (compared using ==)

       Diff:
       @@ -1 +1,30 @@
       + 
       + 
       +  
       + 
       +   
       +  
       +   
       +
       +PING &??
       +
       +Vanco
       +7 gorth a
       +surrard Inlet re Vancouve yo Anmore
       +(oO Coquitlam
       +Dowesown os q
       +
       +?? Wan
       +West Side
       +Sealielone> >
       +
       +wm
       +Richmond
       +
       +??
       +(a9)
       +Steveston
       +> Placed
       +
       +Seafgir

     # ./spec/henkei_spec.rb:51:in `block (4 levels) in <top (required)>'

Finished in 28.7 seconds (files took 0.22328 seconds to load)
31 examples, 3 failures

Failed examples:

rspec ./spec/henkei_spec.rb:168 # Henkei initialized with a given path when passing in the `pipe-error.png` test file #html returns an empty body
rspec ./spec/henkei_spec.rb:164 # Henkei initialized with a given path when passing in the `pipe-error.png` test file #text returns an empty result
rspec ./spec/henkei_spec.rb:48 # Henkei.read when passing in the `pipe-error.png` test file returns an empty result
gsar commented 4 years ago

@abrom forgot to mention, all pdf urls i tried failed for me, so the reproduction seems trivial. might be worth adding one to the samples as there aren't any pdf files in the tests.

Edit: i have determined that the failure only happens if rails is loaded and when using a url or io object. downloading the same file to local disk and passing the file path doesn't fail. example:

irb(main):001:0> require 'henkei'                                                                                                          
=> true                                                                                                                                    
irb(main):002:0> Henkei.new('http://africau.edu/images/default/sample.pdf').metadata                                                       
=> {"Content-Type"=>"application/pdf", "Creation-Date"=>"2006-03-01T07:28:26Z", "X-Parsed-By"=>["org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.pdf.PDFParser"], "access_permission:assemble_document"=>"true", "access_permission:can_modify"=>"true", "access_permission:can_print"=>"true", "access_permission:can_print_degraded"=>"true", "access_permission:extract_content"=>"true", "access_permission:extract_for_accessibility"=>"true", "access_permission:fill_in_form"=>"true", "access_permission:modify_annotations"=>"true", "dc:format"=>"application/pdf; version=1.3", "dcterms:created"=>"2006-03-01T07:28:26Z", "meta:creation-date"=>"2006-03-01T07:28:26Z", "pdf:PDFVersion"=>"1.3", "pdf:charsPerPage"=>["569", "367"], "pdf:docinfo:created"=>"2006-03-01T07:28:26Z", "pdf:docinfo:creator_tool"=>"Rave (http://www.nevrona.com/rave)", "pdf:docinfo:producer"=>"Nevrona Designs", "pdf:encrypted"=>"false", "pdf:hasXFA"=>"false", "pdf:hasXMP"=>"false", "pdf:unmappedUnicodeCharsPerPage"=>["0", "0"], "xmp:CreatorTool"=>"Rave (http://www.nevrona.com/rave)", "xmpTPg:NPages"=>"2"}                       
irb(main):003:0> require 'open-uri'                                                                                                        
=> true                                                                                                                                    
irb(main):004:0> Henkei.new(Kernel.open('http://africau.edu/images/default/sample.pdf')).metadata                                          
=> {"Content-Type"=>"application/pdf", "Creation-Date"=>"2006-03-01T07:28:26Z", "X-Parsed-By"=>["org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.pdf.PDFParser"], "access_permission:assemble_document"=>"true", "access_permission:can_modify"=>"true", "access_permission:can_print"=>"true", "access_permission:can_print_degraded"=>"true", "access_permission:extract_content"=>"true", "access_permission:extract_for_accessibility"=>"true", "access_permission:fill_in_form"=>"true", "access_permission:modify_annotations"=>"true", "dc:format"=>"application/pdf; version=1.3", "dcterms:created"=>"2006-03-01T07:28:26Z", "meta:creation-date"=>"2006-03-01T07:28:26Z", "pdf:PDFVersion"=>"1.3", "pdf:charsPerPage"=>["569", "367"], "pdf:docinfo:created"=>"2006-03-01T07:28:26Z", "pdf:docinfo:creator_tool"=>"Rave (http://www.nevrona.com/rave)", "pdf:docinfo:producer"=>"Nevrona Designs", "pdf:encrypted"=>"false", "pdf:hasXFA"=>"false", "pdf:hasXMP"=>"false", "pdf:unmappedUnicodeCharsPerPage"=>["0", "0"], "xmp:CreatorTool"=>"Rave (http://www.nevrona.com/rave)", "xmpTPg:NPages"=>"2"}                       
irb(main):005:0> require 'rails'                                                                                                           
=> true                                                                                                                                    
irb(main):006:0> Henkei.new('http://africau.edu/images/default/sample.pdf').metadata                                                       
#<Thread:0x00005617db1c9dc0@/home/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/open3.rb:347 run> terminated with exception (report_on_exception is true):                                                                                                                                
Traceback (most recent call last):                                                                                                         
        1: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/open3.rb:347:in `block (2 levels) in capture2'                             
/home/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/open3.rb:347:in `read': closed stream (IOError)                                            
Exception in thread "main" org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes                                
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:122)                                                        
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209)                                                                
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480)                                                                           
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:152)                                                                              
Traceback (most recent call last):                                                                                                         
       16: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/bundler/cli.rb:463:in `exec'                                               
       15: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/bundler/cli/exec.rb:28:in `run'                                            
       14: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/bundler/cli/exec.rb:74:in `kernel_load'                                    
       13: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/bundler/cli/exec.rb:74:in `load'                       
       12: from /home/gsar/.rbenv/versions/2.6.5/bin/irb:23:in `<top (required)>'                                                          
       11: from /home/gsar/.rbenv/versions/2.6.5/bin/irb:23:in `load'                                                                      
       10: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/gems/2.6.0/gems/irb-1.2.1/exe/irb:11:in `<top (required)>'                       
        9: from (irb):6                                                                                                                    
        8: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/gems/2.6.0/gems/henkei-1.23.0/lib/henkei.rb:103:in `metadata'                    
        7: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/gems/2.6.0/gems/henkei-1.23.0/lib/henkei.rb:33:in `read'        
        6: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/gems/2.6.0/gems/henkei-1.23.0/lib/henkei.rb:229:in `client_read'
        5: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/open3.rb:342:in `capture2'                                
        4: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/open3.rb:159:in `popen2'           
        3: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/open3.rb:219:in `popen_run'        
        2: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/open3.rb:353:in `block in capture2'
        1: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/open3.rb:353:in `write'            
Encoding::UndefinedConversionError ("\xE2" from ASCII-8BIT to UTF-8)
irb(main):007:0> Henkei.new(Kernel.open('http://africau.edu/images/default/sample.pdf')).metadata
#<Thread:0x00005617db454248@/home/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/open3.rb:347 run> terminated with exception (report_on_exception is true):
Traceback (most recent call last):
        1: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/open3.rb:347:in `block (2 levels) in capture2'
/home/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/open3.rb:347:in `read': closed stream (IOError)
Exception in thread "main" org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:122)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:152)
Traceback (most recent call last):
       16: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/bundler/cli/exec.rb:28:in `run'
       15: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/bundler/cli/exec.rb:74:in `kernel_load'
       14: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/bundler/cli/exec.rb:74:in `load'
       13: from /home/gsar/.rbenv/versions/2.6.5/bin/irb:23:in `<top (required)>'
       12: from /home/gsar/.rbenv/versions/2.6.5/bin/irb:23:in `load' 
       11: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/gems/2.6.0/gems/irb-1.2.1/exe/irb:11:in `<top (required)>'
       10: from (irb):6
        9: from (irb):7:in `rescue in irb_binding'
        8: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/gems/2.6.0/gems/henkei-1.23.0/lib/henkei.rb:103:in `metadata'
        7: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/gems/2.6.0/gems/henkei-1.23.0/lib/henkei.rb:33:in `read'
        6: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/gems/2.6.0/gems/henkei-1.23.0/lib/henkei.rb:229:in `client_read'
        5: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/open3.rb:342:in `capture2'
        4: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/open3.rb:159:in `popen2'
        3: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/open3.rb:219:in `popen_run'
        2: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/open3.rb:353:in `block in capture2'
        1: from /home/gsar/.rbenv/versions/2.6.5/lib/ruby/2.6.0/open3.rb:353:in `write'
Encoding::UndefinedConversionError ("\xE2" from ASCII-8BIT to UTF-8)
irb(main):008:0> Henkei.new(Kernel.open('./sample.pdf')).metadata     
=> {"Content-Type"=>"application/pdf", "Creation-Date"=>"2006-03-01T07:28:26Z", "X-Parsed-By"=>["org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.pdf.PDFParser"], "access_permission:assemble_document"=>"true", "access_permission:can_modify"=>"true", "access_permission:can_print"=>"true", "access_permission:can_print_degraded"=>"true", "access_permission:extract_content"=>"true", "access_permission:extract_for_accessibility"=>"true", "access_permission:fill_in_form"=>"true", "access_permission:modify_annotations"=>"true", "dc:format"=>"application/pdf; version=1.3", "dcterms:created"=>"2006-03-01T07:28:26Z", "meta:creation-date"=>"2006-03-01T07:28:26Z", "pdf:PDFVersion"=>"1.3", "pdf:charsPerPage"=>["569", "367"], "pdf:docinfo:created"=>"2006-03-01T07:28:26Z", "pdf:docinfo:creator_tool"=>"Rave (http://www.nevrona.com/rave)", "pdf:docinfo:producer"=>"Nevrona Designs", "pdf:encrypted"=>"false", "pdf:hasXFA"=>"false", "pdf:hasXMP"=>"false", "pdf:unmappedUnicodeCharsPerPage"=>["0", "0"], "xmp:CreatorTool"=>"Rave (http://www.nevrona.com/rave)", "xmpTPg:NPages"=>"2"}
irb(main):009:0> 
gsar commented 4 years ago

@abrom passing binmode: true to the capture2 call fixes it for me, as does using your branch with the force_encoding('UTF-8'). i think setting binmode is the safer option, fwiw.

abrom commented 4 years ago

Nice one @gsar! I hadn't been able to replicate it myself, but loading Rails breaks it for me too.

Yes agree, configuring binmode looks to be the better option.

In terms of testing this, I'll look to add to the build matrix so the code can be tested with/without whatever Rails is doing that messes up the parsing, and will add some tests that include local and remote PDFs.

abrom commented 4 years ago

In terms of the errors you were seeing in the specs, LANG=en_CA.UTF-8 rspec returns all success for me. For both my machine and the Travis CI, Tika is unable to read the contents of the image at all. My guess is that you must have some other library installed that Tika is using to OCR the text out. Maybe tesseract ?

gsar commented 4 years ago

@abrom yes i do have the tesseract libs installed due to another dependency.

abrom commented 4 years ago

Hmm might be something to add to the build matrix? You're welcome to add it if you want to, otherwise It's likely safe to just ignore

abrom commented 4 years ago

Fixed by #12

abrom commented 4 years ago

FYI @gsar this has been released in v1.23.1