WebMemex / webmemex-extension

📇 Your digital memory extension, as a browser extension
https://webmemex.org
Other
208 stars 45 forks source link

PDF support added #88

Closed the-fallen closed 7 years ago

the-fallen commented 7 years ago

Sorry for the late commit, been really busy with college assignments lately. Closed the last pull request and made this new one due to some issues. As @Treora said, this commit takes cares of pdfjs-worker cleanup, doesn't load pdf twice and doesn't check if it's pdf twice. The reason i think this is a better solution is if we do it in background, in case of a different module, we have to wait on an async chrome.tabs.query call everytime to get url and check if it's a pdf. Also, this saves alot of code replication. Thanks alot for reviewing

Treora commented 7 years ago

Nice, thanks for continuing with this!

Just FYI, you know you can just reset a branch and git push -f to continue on the same PR with new code? This is okay too though.

Your choice of combining extractPageMetadata and extractPageText into one function makes sense, though I would leave the ugly details of calling Readability in their own file. Note that the current extraction stuff was all made rather quickly and tentatively. I would not mind at all to reorganise it more thoroughly, and think again which data we want to store and how (suggestions welcome).

As for code details, some style needs cleaning (try npm run lint-fix), try refactor and comment extract-pdf-data to make it pleasant to read (also feel free to use await instead of then), and use the new fetch API instead of the XMLHttpRequest.

the-fallen commented 7 years ago

@Treora As you mentioned, Readability mess is now in a seperate file, comments have been added in extract-pdf-data wherever needed, used await instead of then everywhere(thanks for the suggestion, it really does look a lot more understandable!), used fetch API to get the pdf, did lint-fix and removed most of the errors. Also, to improve search for pdfs, i added a few fields in searchableTextFields for metadata like author, subject, keywords. Do give your thoughts if you think any other search field should be added. Thanks alot for your inputs.

the-fallen commented 7 years ago

@Treora My bad, i didn't touch that part of oliver's code with callbacks, should've rewritten. Implemented now as you said with minor changes. Did all the other changes as you pointed out. Thanks for your time

Treora commented 7 years ago

Beautiful. I will test it out, if it works then it seems ready for merging (I will go through it once more and do some small comment rewording and style nitpicking).

Do please let know if you copy parts of other people's code; I am quite conscious about copyright, trying to keep everything this repo in the public domain (should still declare that somewhere); see e.g. Unlicense. On that note, would you be willing to waive all your copyrights on your contributions?

the-fallen commented 7 years ago

@Treora Yes, of course, i am willing. Also, no i haven't copied code from outside, nothing there to copy really, even the last traces of oliver's code are gone after rewriting pdf text extraction. On another note, before merging, after you're done testing, do remove the console log statement in page-analysis/background/index.js line 35

Treora commented 7 years ago

I just tested this with a few more pdf files, seems to work so far. As discussed, the data model should be revised, and also problems may still pop up; especially performance on large documents could be an issue. But merging now anyway: we will discover and solve those problems as we go.

Merged in d59ce46.

Treora commented 7 years ago

And, of course, thanks for the effort!

the-fallen commented 7 years ago

Happy to contribute!