WebMemex / webmemex-extension

📇 Your digital memory extension, as a browser extension
https://webmemex.org
Other
208 stars 45 forks source link

PDF support added #79

Closed the-fallen closed 7 years ago

the-fallen commented 7 years ago

As mentioned in issue #27 , This includes module for pdf extraction and some minor changes in page analysis to support pdfs, For testing, Do npm install to get pdfjs-dist library(changes in package.json), or run npm install pdfjs-dist --save After make, to see it work, see background.js output in console to see text and metadata objects.

As @Treora mentioned about the two situations about user visiting a document and just having the url, In case the file is in cache, XMLHttpRequest gets file from cache without downloading again. Also, pdf worker added in extension folder doesn't look clean, i know. But PDFJS library uses document.currentscript to load worker script which is disallowed in content scripts. So this is a workaround i had to do to get it working, i'd be glad if someone could suggest a cleaner fix.

blackforestboi commented 7 years ago

I tested it the text and metadata extraction works! (although not searchable due to the current bug of the search not working)

the-fallen commented 7 years ago

Thanks for the browser API correction(updated that) and pdf worker suggestion!

Also, to solve both problems at once, double pdf loads and double pdf checks, maybe we can have a single extract-pdf-data.js inside page-analysis/content_script which returns a promise for an object containing metadata and text both, and have a single pdf check in page-analysis/background/index.js then decide whether to use extract-pdf-data.js to set both text and metadata at once in case of pdf, or use extract-page-text.js and extract-page-metadata.js in case of normal page as usual.

And I'll be working on my project proposal for a few days, so will implement this asap after getting done with that.