Text extraction feature

huming2207 commented 2 years ago

Hi there,

Thanks for your great library!

Recently I'm planning to implement an Android & iOS & desktop app for managing electronic component datasheets or reference manuals from LCSC or Digikey. Here are some PDF datasheet examples from the chip manufacturer:

Basically, I would like to extract the text from these documents, allowing the user to perform full-text, cross-document searches and then highlight the keywords on a particular page. Usually, these PDF documents are written in English or Chinese, sometimes can be Japanese too (as most of the chip manufacturers are from the countries or regions that speak these languages).

I'm wondering is there any trivial way to extract text from a PDF? One workaround I can think of is probably rendering the PDF to PdfPageImage, then call createImageIfNotAvailable() to get the flutter Image object and then OCR it via some sort of OCR libraries. But of course, it's very inefficient and it can be inaccurate in some scenarios.

Meanwhile, PDF.js seems to have this functionality that is page.getTextContent(), and it seems that with some workarounds it's possible to extract the whole PDF. Do you have any hints on implementing something like this? I may have a try on my own during New Year holiday and submit a pull request if I can make it.

Thanks & Regards, Jackson Hu

espresso3389 commented 2 years ago

Currently, the library does not provide such feature at all.

For Web and iOS, it is easy task by using platform provided libraries. For Android, the library currently uses Android provided PdfRenderer.

I've been planning use of pdfium directly on Android to espose more PDF features (sobar_pdf is the project I'm working for the purpose). But it's still half-done status because I've been busy for other tasks...

huming2207 commented 2 years ago

Hi @espresso3389 ,

Thanks for your reply. I had a closer look and indeed it's much harder than I thought. I guess I should have a try on using PDF.js + webview for now. Or alternatively, I'm also thinking if I can somehow run PDF.js in the background and do some text extractions.

Meanwhile, I think the text extraction is not the hard part. The hardest part I think it's the highlighting, i.e. I need to know where and which is the keyword, and then draw something on top. I'm also thinking to implement a text extract library somehow just for doing that with PDFBox on Android or PDFKit on iOS, or maybe I should even just scrap the idea of using Flutter and implement some native apps to get started first.

espresso3389 commented 2 years ago

@huming2207 For text highlighting, text extraction should be done using pdf.js but the highlighting can be done using PdfViewerParams.buildPageOverlay property.

PdfViewer.openFile(
  "somewhere/file.pdf",
  params: PdfViewerParams(
    buildPageOverlay: overlayTextHightlights
  ),
)

...

Widget overlayTextHightlights(
BuildContext context,
int pageNumber,
Rect pageRect
) {
// TODO: implement your highlight logic here...
}

huming2207 commented 2 years ago

Hi @espresso3389 ,

Thanks for the info. I'm thinking to do something more complex, probably by putting a canvas on top of it so that I can even display stylus strokes.

For text extraction I think I can use the TextContent object from PDF.js. It looks like this:

I can use it to work out the coordinate of a specific text box by reading transform field, and the text itself at str field. But still I think it's definitely not the optimal way to do so. I'm also looking at some native libraries and see if I have any luck.

Regards, Jackson

huming2207 commented 2 years ago

Looks like for Apple PDFKit I can use findString() to do so, and get a PDFSelection back.

For Android, maybe with the PDFBox port, it's also doable but less trivial: https://www.tutorialkart.com/pdfbox/how-to-extract-coordinates-or-position-of-characters-in-pdf/

I will have a try later and implement a separate Flutter binding library or contribute to your library if I have time to do so.

Taron133 commented 2 years ago

PDFKit has not night mode! Use 2 libs...

huming2207 commented 2 years ago

PDFKit has not night mode! Use 2 libs...

I think this is a bit off-topic...?

espresso3389 / flutter_pdf_render

Text extraction feature #65