abrom / henkei

Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
http://github.com/abrom/henkei
MIT License
74 stars 14 forks source link

Error No such file or directory-Errno::ENOENT #14

Open Jasmeet2011 opened 4 years ago

Jasmeet2011 commented 4 years ago

HI, I am trying to read a word document but i keep getting the error below. I am using Windows 10 and 'echo %JAVA_HOME%' gives this 'C:\Program Files\Java\jdk1.8.0_191'

'C:/Ruby26-x64/lib/ruby/2.6.0/open3.rb:213:in spawn': No such file or directory - C:\Program Files\Java\jdk1.8.0_191/bin/java -Djava.awt.headless=true -jar C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/henkei-1.23.1/jar/tika-app-1.23.jar --config=C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/henkei-1.23.1/jar/tika-config.xml -t (Errno::ENOENT) from C:/Ruby26-x64/lib/ruby/2.6.0/open3.rb:213:inpopen_run' from C:/Ruby26-x64/lib/ruby/2.6.0/open3.rb:159:in popen2' from C:/Ruby26-x64/lib/ruby/2.6.0/open3.rb:342:incapture2' from C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/henkei-1.23.1/lib/henkei.rb:229:in client_read' from C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/henkei-1.23.1/lib/henkei.rb:33:inread' from C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/henkei-1.23.1/lib/henkei.rb:81:in text' from -:3:in

'

If i go to command prompt and run 'java -Djava.awt.headless=true -jar C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/henkei-1.23.1/jar/tika-app-1.23.jar --config=C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/henkei-1.23.1/jar/tika-config.xml -t' There is no error. Can you pl guide.

abrom commented 4 years ago

it sounds like you've already pretty much identified the problem. Henkei is looking in the path specified by the JAVA_HOME ENV var for the Java binary. It's apparently not there. Not much I can do to help you there but suggest you check where you have Java installed.

If it helps, this is how Henkei is building the path to the Java bin:

  def self.java_path
    ENV['JAVA_HOME'] ? ENV['JAVA_HOME'] + '/bin/java' : 'java'
  end
Jasmeet2011 commented 4 years ago

Thanks for the suggestion. I understand that the issue is related to Java_home and i have been trying several ways to isolate the problem but i am going nowhere. I have done the following.

  1. I have uninstalled and reinstalled Java. However the problem persists and Java is also in path. I notice in the error though that the location of java file being referred to is slightly different. Pl notice the forward slash before bin and java. 'C:\Program Files\Java\jdk1.8.0_191/bin/java'

  2. I have rechecked that the java binary is indeed in this directory 'C:\Program Files (x86)\Java\jre1.8.0_251\bin'

  3. So tried to change the forward to back slash in the Ruby file and landed with the same error

'C:/Ruby26-x64/lib/ruby/2.6.0/open3.rb:213:in `spawn': No such file or directory - C:\Program Files (x86)\Java\jre1.8.0_251\bin\java ---------- (Errno::ENOENT)'

  1. Checked the Java_home and got the same path as stated above 'echo %JAVA_HOME% C:\Program Files (x86)\Java\jre1.8.0_251'
  2. Then i executed in the command prompt this 'for %i in (java.exe) do @echo. %~$PATH:i' AND GOT THIS 'C:\Program Files (x86)\Common Files\Oracle\Java\javapath\java.exe' and realised that the Path is different.
  3. Corrected it and the new path appears to be the same where the java binary is placed. 'for %i in (java.exe) do @echo. %~$PATH:i C:\Program Files (x86)\Java\jre1.8.0_251\bin\java.exe'
  4. Tried to get the JRE info and got this: 'HKEY_LOCAL_MACHINE\SOFTWARE\WOW6432Node\JavaSoft\Java Runtime Environment CurrentVersion REG_SZ 1.8

JRE VERSION: 1.8 HKEY_LOCAL_MACHINE\SOFTWARE\WOW6432Node\JavaSoft\Java Runtime Environment\1.8 JavaHome REG_SZ C:\Program Files (x86)\Java\jre1.8.0_251

JavaHome: C:\Program Files (x86)\Java\jre1.8.0_251'

  1. Hoped that I corrected the error! but landed in the same place.
  2. Any suggestions to move forward will be appreciated. thanks in advance
abrom commented 4 years ago

Not sure I can add much. I'd suggest trying to replicate the error from a Windows shell. ie what happens if you execute the command:

C:\Program Files\Java\jdk1.8.0_191/bin/java -Djava.awt.headless=true -jar C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/henkei-1.23.1/jar/tika-app-1.23.jar --config=C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/henkei-1.23.1/jar/tika-config.xml -t <some file path>

(of course replace <some file path> with the path to a Word doc or text file.)

abrom commented 4 years ago

Also I noticed you're referencing two different paths. The one in your original post was the JDK but your second post referenced the JRE. Would be worth double checking there isn't an issue there?

Jasmeet2011 commented 4 years ago

Ran this in the command prompt and received nothing except one blank command prompt ' C:\Users\Desktop> C:\Users\Desktop>'

  1. With regard to your second observation, i removed Java completely from the system and reinstalled. In the earlier installation Java directory had both JDK and JRE and i was referencing Java binary in JDK folder. With the current installation Java installed only JRE . Pl inform whether both JDK and JRE need to be installed and if so which Path should be referenced. Thanks a lot
abrom commented 4 years ago

Not sure I follow what you mean regarding what you actually ran? You've only pasted some quote characters.

I'm by no means a Java expert, but you should only need the JRE to run

Jasmeet2011 commented 4 years ago

As suggested by you, I executed this statement 'C:\"Program Files (x86)"\Java\jre1.8.0_251\bin\java -Djava.awt.headless=true -jar C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/henkei-1.23.1/jar/tika-app-1.23.jar --config=C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/henkei-1.23.1/jar/tika-config.xml -t C:/Users/Documents/Ruby/test.doc

and got the following in the command prompt

'Error: Could not find or load main class .awt.headless=true'

Pl suggest.

Jasmeet2011 commented 4 years ago

Also tried executing the command from the directory containing the Java binary C:\Program Files (x86)\Java\jre1.8.0_251> java -Djava.awt.headless=true -jar C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/henkei-1.23.1/jar/tika-app-1.23.jar --config=C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/henkei-1.23.1/jar/tika-config.xml -t C:/Users/Sun/Documents/Ruby/test.doc Got the same error Error: Could not find or load main class .awt.headless=true

abrom commented 4 years ago

Sorry @Jasmeet2011 I can't help you. This issue seems to be related to your system install of Java and not Henkei.

Jasmeet2011 commented 4 years ago

Hi, I finally manged to install Java in a different path and could start working. I can now read a text file but docx, xlsx and pdf throw an error 'Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pkg.PackageParser@1b9a632 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:152) Caused by: org.apache.tika.io.TaggedIOException: Truncated ZIP file at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) at org.apache.commons.io.IOUtils.read(IOUtils.java:3077) at org.apache.commons.io.IOUtils.read(IOUtils.java:3099) at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:110) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:104) at org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:351) at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:288) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 5 more Caused by: org.apache.tika.io.TaggedIOException: Truncated ZIP file at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) ... 15 more Caused by: java.io.IOException: Truncated ZIP file at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readDeflated(ZipArchiveInputStream.java:560) at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.read(ZipArchiveInputStream.java:458) at java.io.BufferedInputStream.fill(Unknown Source) at java.io.BufferedInputStream.read1(Unknown Source) at java.io.BufferedInputStream.read(Unknown Source) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) ... 17 more' Can you pl suggest where am I going wrong?

abrom commented 4 years ago

From the stack trace I can see all of this is coming from Apache Tika (the library Henkei calls to for extracting the contents etc).

Searching the web for the error shows up this (long running) issue on the Tika issue tracker.

https://issues.apache.org/jira/browse/TIKA-2407

If you have a read through the stack trace it's also telling you the problem:

Caused by: java.io.IOException: Truncated ZIP file

ie.. corrupted zip file.

DOCX, XLSX formats are simply a bunch of XML files stored in a ZIP file which would explain why you're seeing errors about ZIP files..

Unfortunately this issue comes down to the files you're feeding into it. If they're corrupted it's unlikely Tika will be able to read them! If you believe the files are not corrupted then I'd suggest you raise an issue with the Tika project and see if they might be able to help you.

Jasmeet2011 commented 4 years ago

ok, thanks

Jasmeet2011 commented 4 years ago

Apparently this error is not a Tika issue as i managed to extract the content of the document using this command line. C:\Ruby26-x64\lib\ruby\gems\2.6.0\gems\henkei-1.23.1\jar>java -jar tika-app-1.23.jar -t "C:\Users\Downloads\test.docx" with the following result Jun 17, 2020 10:40:40 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem Tesseract may dramatically slow down content extraction (TIKA-2359). As of Tika 1.15 (and prior versions), Tesseract is automatically called. In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig. Jun 17, 2020 10:40:40 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version

**This is a test document** I run the same document from Ruby file henkei = Henkei.new 'C:/Users/Sun/Downloads/test.docx' text = henkei.text

and get this error

Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pkg.PackageParser@1b9a632 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) Any thoughts on this

Jasmeet2011 commented 4 years ago

I tried streaming a file from web henkei = Henkei.new 'http://svn.apache.org/repos/asf/poi/trunk/test-data/document/sample.docx' text = henkei.text and it worked! Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Nunc at risus vel erat tempus posuere. Aenean non ante. Suspendisse vehicula dolor sit amet odio. Sed at sem. Nunc fringilla. Etiam ut diam. Nunc diam neque, adipiscing sed, ultrices a, pulvinar vitae, mauris. Suspendisse at elit vitae quam volutpat dapibus. Phasellus consequat magna in tellus. Mauris mauris dolor, dapibus sed, com But it still fails in reading any file from my system. Can you pl tell where am i going wrong

abrom commented 3 years ago

FYI I believe the root of this issue has been fixed by #19

Can you please try update to the latest release and let me know if the problem persists

cyndilopez commented 3 years ago

Wanted to comment and say that I also get the same error attempting to read a file in Windows 10 using henkei 1.27.1. The same code works perfectly on my mac machine. But on Windows, the following:

require 'henkei'

data = File.read '/Users/me/Documents/original.docx'
office_properties = Henkei.read :metadata, data

results in:

Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pkg.PackageParser@c667f46
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:287)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:210)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:153)
Caused by: org.apache.tika.io.TaggedIOException: Truncated ZIP file
        at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
        at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
        at org.apache.commons.io.IOUtils.read(IOUtils.java:1710)
        at org.apache.commons.io.IOUtils.read(IOUtils.java:1685)
        at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:114)
        at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)
        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
        at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:104)
        at org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:450)
        at org.apache.tika.parser.pkg.PackageParser.parseEntries(PackageParser.java:370)
        at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:321)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
        ... 5 more
Caused by: org.apache.tika.io.TaggedIOException: Truncated ZIP file
        at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
        at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
        at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
        at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
        ... 16 more
Caused by: java.io.IOException: Truncated ZIP file
        at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readDeflated(ZipArchiveInputStream.java:590)
        at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.read(ZipArchiveInputStream.java:488)
        at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:245)
        at java.base/java.io.BufferedInputStream.read1(BufferedInputStream.java:285)
        at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:344)
        at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
        ... 18 more
C:/ruby/lib/ruby/2.4.0/json/common.rb:156:in `parse': 751: unexpected token at '' (JSON::ParserError)
        from C:/ruby/lib/ruby/2.4.0/json/common.rb:156:in `parse'
        from C:/ruby/lib/ruby/gems/2.4.0/gems/henkei-1.27.1/lib/henkei.rb:59:in `read'
        from C:/Users/pipeline/Documents/test_henkei.rb:3:in `<main>'
abrom commented 3 years ago

Hi @cyndilopez the error you've posted looks very different to the original error. The original error for this issue related to a Java path issue. Yours appears to be about a corrupted DOCX file. Can you please create a separate issue for this.

cyndilopez commented 3 years ago

oops my bad, I thought you were trying to fix the error that popped up for the OP on June 17, 2020. I don't think the docx files are corrupted b/c on my mac machine the same exact files are used with the same code and it works perfectly fine - but on my virtual machine running Windows 10, it keeps running into the error posted above. I'll open up another issue