balta2ar / brotab

Control your browser's tabs from the command line
MIT License
389 stars 27 forks source link

Download pdf from tabs #15

Open doronbehar opened 4 years ago

doronbehar commented 4 years ago

Hey and thanks for creating this very useful tool.

Say I have a PDF file or other webpage open in a certain tab and I'd like to download it but this page / file is protected by a login page. It would have been great to use brotab for this task.

I've found a less then ideal alternative but it's usage is clumsy in comparsion to the rest of brotab's awesome UI / design - https://addons.mozilla.org/en-US/firefox/addon/cliget/ .

balta2ar commented 4 years ago

Judging from the amount of code in cliget, on the first glance it seems like an elaborate task to replicate this functionality here in brotab, which I'd naturally prefer to avoid. The difficulty is, AFAIU, extracting cookies. From what I can see, cliget intercepts onRequestStarted requests and reads cookies from headers. I'm not sure how download requests are initiated, though. Ideally, of course I'd prefer having an API at cliget side so that I could just simply reuse it if it's available. @zaidka what do you think?

@doronbehar feel free to suggest other ideas, I'm noob when it comes to frontend part really. Also, looking at the amount issues I assume it gets hairy sometimes and doesn't work 100% of the time. That's just my first impression after looking the cliget code for 20 minutes.

doronbehar commented 4 years ago

I have an idea but I haven't researched it: Maybe the extension could directly use the API to pipe the whole content of the webpage to brotab the executable - via native messaging. In the case of a tab with a PDF file this may not be so trivial as with plain HTML pages. This way there are no cookies / other headers we need to handle...

balta2ar commented 4 years ago

That's the trivial part and you can do it now using "bt text | cut -f4 | grep -o url_matching_regexp". But you won't be able to download login-protected urls using that, you need cookies for that.

doronbehar commented 4 years ago

When I wrote:

pipe the whole content of the webpage to brotab the executable

I specifically meant: The HTML of the webpage, not the plain text version of it. Hence I'm not sure I figured out what would be the purpose of grep -o url_matching_regexp...

The same way the extension tells the executable the text, can't it provide the executable the HTML for it to print as well? Perhaps could you navigate me to the location in the code where the extension does it so I might investigate similar Web Extension APIs and suggest a conciser way for solution?

balta2ar commented 4 years ago

I think it's just a misleading name "bt text". What you actually get is html. Try it.

On Mon, Jan 6, 2020, 16:03 Doron Behar notifications@github.com wrote:

When I wrote:

pipe the whole content of the webpage to brotab the executable

I specifically meant: The HTML of the webpage, not the plain text version of it. Hence I'm not sure I figured out what would be the purpose of grep -o url_matching_regexp...

The same way the extension tells the executable the text, can't it provide the executable the HTML for it to print as well? Perhaps could you navigate me to the location in the code where the extension does it so I might investigate similar Web Extension APIs and suggest a conciser way for solution?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/balta2ar/brotab/issues/15?email_source=notifications&email_token=AACTFRD7WYTTK6FERT2AWOLQ4NB3RA5CNFSM4KC3LSW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIFWFQY#issuecomment-571171523, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACTFRHA5LSIHY4G2UCEWY3Q4NB3RANCNFSM4KC3LSWQ .

balta2ar commented 4 years ago

Or maybe I'm wrong: https://github.com/balta2ar/brotab/blob/0485ea9bfbae328833761442062c61eb4a4c73b7/brotab/extension/background.js#L6

It should be trivial to add a method to retrieve html if it's not there though

On Mon, Jan 6, 2020, 20:03 Yuri Bochkarev baltazar.bz@gmail.com wrote:

I think it's just a misleading name "bt text". What you actually get is html. Try it.

On Mon, Jan 6, 2020, 16:03 Doron Behar notifications@github.com wrote:

When I wrote:

pipe the whole content of the webpage to brotab the executable

I specifically meant: The HTML of the webpage, not the plain text version of it. Hence I'm not sure I figured out what would be the purpose of grep -o url_matching_regexp...

The same way the extension tells the executable the text, can't it provide the executable the HTML for it to print as well? Perhaps could you navigate me to the location in the code where the extension does it so I might investigate similar Web Extension APIs and suggest a conciser way for solution?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/balta2ar/brotab/issues/15?email_source=notifications&email_token=AACTFRD7WYTTK6FERT2AWOLQ4NB3RA5CNFSM4KC3LSW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIFWFQY#issuecomment-571171523, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACTFRHA5LSIHY4G2UCEWY3Q4NB3RANCNFSM4KC3LSWQ .

zaidka commented 4 years ago

Not familiar with this extensions, but in cliget it should be possible to get the download command from another extension. cliget listens to the message "generateCommand". I haven't tested that myself though but I'm happy to assist with that.

balta2ar commented 4 years ago

Say I have a PDF file or other webpage open in a certain tab and I'd like to download it

I misinterpreted or just missed that part upon the first reading. Now I see that you want just raw content of a page (html).

@zaidka thanks for the prompt reply, appreciate it. It looks like @doronbehar had meant something simpler than I managed to read between the lines :)

The idea of integrating extensions together still sounds appealing to me, I could try it just as an experiment.

doronbehar commented 4 years ago

Not familiar with this extensions, but in cliget it should be possible to get the download command from another extension. cliget listens to the message "generateCommand"

I tend to think we don't think we need cliget at all since as @balta2ar has said before, this may lead to users encountering downloading errors because of cookie issues. Our implementation should be more reliable if we'll use straight the API.

It should be trivial to add a method to retrieve html if it's not there though

Sounds good! Having access to the HTML would definitely be an improvement but TBH I'm mostly interested in getting PDFs raw content. After the HTML method will be implemented, we'll need to test whether 'HTML' content of a tab will be retrieved as actual pdf.js HTML vs real raw PDF.

PS: Another point I was thinking of: It could be nice to be able to select a text from a specific tab in both brotab text and the currently planned brotab html. @balta2ar Would you like me to open a separate issue for this, smaller enhancement request?

balta2ar commented 4 years ago

@doronbehar what do you mean by "select"? bt text returns text along with tab id, you can filter by it.

doronbehar commented 4 years ago

Well I may be able to craft a sed or grep filter but that's not trivial at all IMO. Even now, I can't say I've figured out exactly how the pattern goes..

balta2ar commented 4 years ago

@doronbehar ok, let's create an issue and discuss there. Put as much details as you can, examples, etc

balta2ar commented 4 years ago

I did a quick search on how to extract PDF contents from a page using JS. Results are not satisfying, unfortunately.

Firefox

A quick and intuitive hacking in Firefox console revealed that PDF contents could be extracted as follows:

await window.PDFViewerApplication.pdfDocument.getData()
Uint8Array(100194) [ 37, 80, 68, 70, 45, 49, 46, 51, 10, 37, … ]

Chrome/Chromium

I couldn't find any way to extract data similarly for Chrome/Chromium from the PDF viewer plugin. There are two related unanswered questions about that on StackOverflow: https://stackoverflow.com/questions/45461665/read-file-contents-from-embed-tag-in-chrome https://stackoverflow.com/questions/45806947/how-do-i-access-the-raw-pdf-file-via-javascript-embedded-in-an-object-tag

Kludge

A hacky way that seems to work in both browsers is to just download page URL again and return its contents, e.g.:

await fetch(window.location.toString(), {}).then(r => r.blob())
Blob {size: 100194, type: "application/pdf"}

I don't really like this method (at least it shouldn't be named "bt html", maybe "bt redownload") because if executed blindly without arguments (and given that argument semantics from other commands are applied here as well) it will put a lot of pressure on the browser maybe rendering it unresponsive for quite an annoying interval.

Summary

I see three ways we can proceed here:

  1. Someone more experienced suggests a way to get PDF plugin contents in Chrome, or maybe does research on that topic. Retrieving page contents (including PDF) without redownloading in two supported browsers would be the best solution.
  2. Implement heavy "bt redownload" command.
  3. Work on integration with cliget.
doronbehar commented 4 years ago

Wow @balta2ar I much appreciate your work!

A comment by the OP here suggests a certain route which hasn't been investigated:

Can I register to an event, so when the page is loaded and the file is fetched, my callback will be notified with the file contents? – DenisY

This certainly won't be easy... However, will you consider supporting this feature only for Firefox users in the meantime? After all, the reason it's harder on Chrome is due to the fact they don't use the free (as in speech) PDF.js engine but a different, proprietary engine...

As for implementing the redownload command / integrating with cliget, I'm less fond of these options because they will fail if: