abrom / henkei

Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
http://github.com/abrom/henkei
MIT License
74 stars 14 forks source link

Not properly working with iWork Pages documents #13

Closed amilano closed 4 years ago

amilano commented 4 years ago

After creating a simple apple pages document (only 2 pages long) and trying to extract the text from it I'm instead getting this output:

doc_html = henkei_doc.text
 => "\nIndex/Document.iwa\n\n\nIndex/ViewState.iwa\n\n\nIndex/CalculationEngine-1623126.iwa\n\n\nIndex/AnnotationAuthorStorage-1623125.iwa\n\n\nIndex/DocumentStylesheet.iwa\n\n\nIndex/DocumentMetadata.iwa\n\n\nIndex/Metadata.iwa\n\n\nMetadata/Properties.plist\n\n\nMetadata/DocumentIdentifier\n1B793D61-CA32-46AF-AE75-03E077B1CBCE\n\n\n\nMetadata/BuildVersionHistory.plist\n \n \n\t Template: Blank (10.0)\n\t M10.0-6748-2\n\n\n\n\n\npreview.jpg\n\n\npreview-micro.jpg\n\n\npreview-web.jpg\n\n"

The first page of the file actually contains the following text First page on PAGES text and the second page Second page on PAGES text (yeah, I'm not very creative).

From executing the mimetype method I'm getting application/zip instead of application/vnd.apple.pages.

So, at least for pages documents it's failing to properly extract and identify the content of these type of files.

Is anyone else experiencing the same issue?

abrom commented 4 years ago

Hi @amilano

This appears to be an issue with the Apache Tika library, rather than an issue with Henkei. If i call Tika in the most basic fashion from a shell:

java -Djava.awt.headless=true -jar tika-app-1.23.jar -t < pages-test.pages

I get practically the same result.

The reason why I've opted to pipe the file contents into Tika instead of as a file param is because that's how Henkei does it.

Having said that, if I pass the file name as a parameter then I get NO output, which is also interesting.. ie

java -Djava.awt.headless=true -jar tika-app-1.23.jar -t pages-test.pages

Either way something appears to be very broken in Tika, as far as parsing Pages files is concerned.

I would suggest you report this issue over at the Tika issue tracker and see if they have any insights: https://issues.apache.org/jira/projects/TIKA/issues

abrom commented 4 years ago

Sorry, I didn't address your comment RE mimetype. I also get the same results as you.

{"Content-Type":"application/zip","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pkg.PackageParser"]}%  

Although if passing the filename as a parameter is does return different results:

{"Content-Length":"74357","Content-Type":"application/vnd.apple.unknown.13","X-Parsed-By":"org.apache.tika.parser.EmptyParser","resourceName":"pages-test.pages"}

Again, this is really out of the hands of Henkei as it's Tika that's parsing the files incorrectly.

amilano commented 4 years ago

hey @abrom thank you for your quick response. I'll report it over there.

I'll close this issue since is out of the scope of the gem.

tanks again and stay safe!!!