coolwanglu / pdf2htmlEX

Convert PDF to HTML without losing text or format.
http://coolwanglu.github.com/pdf2htmlEX/
Other
10.39k stars 1.84k forks source link

Add API for search functionality #231

Open rvanlaak opened 11 years ago

rvanlaak commented 11 years ago

The result that pdf2htmlEX outputs is great, and is very suitable to replace Acrobat Reader. One of the features that makes Acrobat favorable above the browser output, is the ability to search in the document.

Feature request: add an search-API in the library, so it is possible to perform text-searches in the document.

Features of the API could be:

When this API works, a next step could be to implement an GUI that makes use of this API. I will make another issue for that.

coolwanglu commented 11 years ago

replace is not possible, at least for now. I don't think it's event supported by PDF readers. I also doubt for add bookmarks

search in PDF bookmarks also sounds like a rare use case to me.

I'm not sure if innerText or :contains is enough for these features: see http://stackoverflow.com/questions/12445020/javascript-window-find-doesnt-work-absolutely

But indeed there is a problem when lazy loading is enabled: pages are not loaded until viewed, so we need to load them before searching for any text.

iapain commented 10 years ago

Possible solution would be either searching text nodes in DOM and highlight them or generate inverted index to use in search (using https://github.com/fagbokforlaget/pdfiijs or pdftotext and feed it into indexing system).

rvanlaak commented 10 years ago

@iapain the library you're proposing sounds great, certainly since I've got both a PDF-file and a pdftotext-output. Does the snowball-js support the following use-case?

My use-case is that I've got fragments from the pdftotext, that I would like to show/mark in the original PDF with its original markup. It would be awesome if I can use pdf2htmlEX in order to preserve the markup from the PDF.

rvanlaak commented 10 years ago

I've been digging through the changelog / release notes / blogspot posts, and found out it is possible to search the output, and compare the html like diffs.

Can you elaborate a bit more on those features, because I could not find any documentation about that.