Closed amilano closed 4 years ago
Hi @amilano
This appears to be an issue with the Apache Tika library, rather than an issue with Henkei. If i call Tika in the most basic fashion from a shell:
java -Djava.awt.headless=true -jar tika-app-1.23.jar -t < pages-test.pages
I get practically the same result.
The reason why I've opted to pipe the file contents into Tika instead of as a file param is because that's how Henkei does it.
Having said that, if I pass the file name as a parameter then I get NO output, which is also interesting.. ie
java -Djava.awt.headless=true -jar tika-app-1.23.jar -t pages-test.pages
Either way something appears to be very broken in Tika, as far as parsing Pages files is concerned.
I would suggest you report this issue over at the Tika issue tracker and see if they have any insights: https://issues.apache.org/jira/projects/TIKA/issues
Sorry, I didn't address your comment RE mimetype. I also get the same results as you.
{"Content-Type":"application/zip","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pkg.PackageParser"]}%
Although if passing the filename as a parameter is does return different results:
{"Content-Length":"74357","Content-Type":"application/vnd.apple.unknown.13","X-Parsed-By":"org.apache.tika.parser.EmptyParser","resourceName":"pages-test.pages"}
Again, this is really out of the hands of Henkei as it's Tika that's parsing the files incorrectly.
hey @abrom thank you for your quick response. I'll report it over there.
I'll close this issue since is out of the scope of the gem.
tanks again and stay safe!!!
After creating a simple apple pages document (only 2 pages long) and trying to extract the text from it I'm instead getting this output:
The first page of the file actually contains the following text
First page on PAGES text
and the second pageSecond page on PAGES text
(yeah, I'm not very creative).From executing the
mimetype
method I'm gettingapplication/zip
instead ofapplication/vnd.apple.pages
.So, at least for
pages
documents it's failing to properly extract and identify the content of these type of files.Is anyone else experiencing the same issue?