ICIJ / node-tika

Apache Tika bridge for Node.js. Text and metadata extraction, language detection and more.
MIT License
140 stars 36 forks source link

cannot extract text from scanned PDF #14

Open arimai opened 8 years ago

arimai commented 8 years ago

I am trying to extract text from scanned pdf documents. It works fine for most of them except a couple I tested. I am able to extract the metadata correctly but not the text in the pdf. It returns with a blank set of lines for the text part. Are there any specific pdf versions or some other criteria that can cause this issue? Does it have anything to do with the pdf producer which in this case is Haru Free PDF Library 2.0.8?

mattcg commented 8 years ago

Does the PDF contain selectable text or an image of the text?

Could you attach the PDF to this issue or send it to me by email if you'd rather not? My address is mcaruana@icij.org.

arimai commented 8 years ago

The PDF is essentially a scanned medical document and does not contain selectable text ( i.e., when you try to select, you can only select an area of an image of the text) . Unfortunately I won't be able to share it with you but I am attaching another sample pdf I found online which behaved the same way. test.pdf I tested such documents using the tika server jar files directly as well and they did not give any output. But it did support jpeg,png and tiff formats. I am guessing the issue lies with tika not supporting scanned pdfs? If yes, do you have any workarounds for this situation?

mattcg commented 8 years ago

does not contain selectable text

The process of turning image text into digital text is called OCR. There's no magical extraction library that will do this without the support of an OCR engine. In this case, you will have have to install Tesseract, which is supported by Tika, and specify the ocrLanguage option in node-tika.

arimai commented 8 years ago

I am using the 'magical' OCR engine tesseract. And have also specified the ocrLanguage option in node-tika. Are you sure you are able to get any results with the pdf I just attached here? Sorry if I was not clear enough earlier that I was using tesseract.

mattcg commented 8 years ago

It seems OCRing wasn't set up correctly. I've made a fix. Will you check the version in master?

arimai commented 8 years ago

I checked the version in master for two scanned pdfs. One didn't give me any result and the second gives the following -

Jun 30, 2016 9:43:15 AM org.apache.pdfbox.tools.imageio.ImageIOUtil writeImage
SEVERE: No ImageWriter found for 'tif' format
Jun 30, 2016 9:43:15 AM org.apache.pdfbox.tools.imageio.ImageIOUtil writeImage
SEVERE: Supported formats: JPG jpg bmp BMP gif GIF WBMP png PNG wbmp jpeg JPEG 

I also tried re-installing tesseract and leptonica with libtiff but didn't solve the problem. Also, it didn't look like that could be an issue since I tested with a sample tiff file and its able to parse with no problem.

This issue references the same problem that I am facing with the first pdf but I saw that your code already does what their solution is. Just included it in case it helps in any way.

mattcg commented 8 years ago

Thanks for testing. That issue seems to be caused by the fact that Tika dropped support for extracting TIFF images from PDFs in 1.13. From the change log:

Release 1.13 - 05/08/2016 ... * Tiff files are no longer extracted by default. See https://pdfbox.apache.org/2.0/dependencies.html#optional-components for optional components to process Tiff files.

I've pushed a fix for this to master and successfully extracted text from the PDF you attached (thanks for that). Will you confirm the test?

arimai commented 8 years ago

Thanks a lot :) It works now for the pdf I attached. The first pdf though does not give any output but I figured its because of their upgrade to pdfbox 2.0.

Release 1.13 - 05/08/2016 ...

  • Some truncated/corrupted files that had some content extracted with 1.8.x may have no content extracted in 2.0.x (see TIKA-1912).

This is the open issue regarding the same. Here Tim Allison talks about a repo he created to shade 1.8.x and use that as a backoff parser .