ariya / phantomjs

Scriptable Headless Browser
http://phantomjs.org
BSD 3-Clause "New" or "Revised" License
29.47k stars 5.75k forks source link

File download #10052

Closed ariya closed 4 years ago

ariya commented 13 years ago

alexsa...@gmail.com commented:

It would be good to accept (and save) 'Content-Disposition: attachment; filename=' content.

Disclaimer: This issue was migrated on 2013-03-15 from the project's former issue tracker on Google Code, Issue #52. :star2:   40 people had starred this issue at the time of migration.

ariya commented 13 years ago

ariya.hi...@gmail.com commented:

This is again related to issue 41.

 
Metadata Updates

ariya commented 13 years ago

roejame...@gmail.com commented:

Issue 92 has been merged into this issue.

btheado commented 13 years ago

brian.th...@gmail.com commented:

I'm trying to implement this functionality and not making much progress. Using the attached patch, I run:

$ bin/phantomjs examples/download.js

and get this output:

WebPage instantiated WebPage instantiated Download complete - fail

I added cout of "WebPage instantiated" (to verify my debug messages work as expected). I also added a cout in my downloadRequested slot. That one did not get displayed. Can someone spot what I'm doing wrong or let me know if I'm on the completely wrong track?

Here is where I found out about the downloadRequested signal: http://doc.qt.nokia.com/latest/qwebpage.html#downloadRequested

btheado commented 13 years ago

brian.th...@gmail.com commented:

Whoops, here is the patch file attachment without the ANSI color codes

n1k0 commented 13 years ago

nperria...@gmail.com commented:

Any progress on this issue?

ariya commented 13 years ago

ariya.hi...@gmail.com commented:

No progress as of now.

n1k0 commented 13 years ago

nperria...@gmail.com commented:

A friend of mine (http://svay.com/) just told me a nice trick for dealing around with this issue, using XHR within the page environment and base64 encoding to retrieve file contents and it works rather great. For the record you can find an example here: http://jsfiddle.net/3kUXy/

ariya commented 12 years ago

gopiredd...@gmail.com commented:

The URL to the file is not always known so XHR is not a general solution. For instance, if you are downloading a utility/bank/cc statement, you may have to click a link which will possibly execute some JS code and trigger another page load with a frame embedding the PDF. Or the statement comes in as an attachment.

What will it take to support the file download feature?

Requirement: Download files that come in embedded in the page/frame or as attachments. The URLs may or may not be known. Allow saving the files to the file system or "upload" them to a web server (so the server can save the files in a DB for instance).

ariya commented 12 years ago

ja...@recovend.com commented:

I've got an early but functional version of this at

https://github.com/woodwardjd/phantomjs/tree/add_download_capabilities

Example:

var page = require('webpage').create();

page.onUnsupportedContentReceived = function(data) { console.log('Got a download at url: ' + data.url); page.saveUnsupportedContent('some.file.path', data.id); phantom.exit(); }

page.open('http://some.pdf.url.com/some.pdf');

I call this "early but functional" because it works where I've tested it (linux, PDF downloads), but has a likely small memory leak, and I'm not 100% convinced the callback mechanism I used is idea.

Comments desired.

ariya commented 12 years ago

rotava...@gmail.com commented:

I've downloaded and built the git for above, but I can't seem to get the onUnsupportedContentReceived event to fire and calling saveUnsupportedContent throws an undefined error. Are there special build steps required to enable it?

Thanks, Robert

ariya commented 12 years ago

ja...@recovend.com commented:

No special build steps required, as far as I know. If saveUnsupportedContent is undefined, maybe you haven't built the version in the add_download_capabilities branch (git checkout add_download_capabilities after the git clone)? Just speculating.

chbrown commented 12 years ago

audi...@gmail.com commented:

I second the XHR+base64 method. It takes another 50+ lines of code to send to page.evaluate(), and I have to de-base64 the content afterward, and that's basically how CasperJS does it (as far as I can tell from their code—they do a lot of weird (unnecessary, in my book) binding with window.utils in the page context).

I used this one (first answer): http://stackoverflow.com/questions/7370943/retrieving-binary-file-content-using-javascript-base64-encode-it-and-reverse-de

It works great. Just be sure to try-catch the call to base64ArrayBuffer(), because Uint8Array(arrayBuffer) may throw an error, and check xhr.getHeader('content-type') == 'application/pdf' if you're doing pdf downloads like I was.

subelsky commented 12 years ago

subel...@gmail.com commented:

I need this as well. Can't use the XHR method because the inline attachments I need to scrape don't come with a URL I can hit.

chbrown commented 12 years ago

audi...@gmail.com commented:

Wouldn't inline attachments be even more easily downloaded? For an image: var content = page.evaluate(function() { return $('img#whatever').attr('src'); }); fs.write(yer_path, content, 'w');


Ariya, can you give some estimate of how long this feature (downloading a url) would take to implement? I'd love to get involved in PhantomJS development, but maybe this issue is a lot trickier than it sounds?

subelsky commented 12 years ago

subel...@gmail.com commented:

Sorry, I didn't mean to write "inline". The file I need is not an image and is not part of the DOM. It gets sent as a result of a POST with the Content-Disposition header 'attachment;filename="report.csv"'

ariya commented 12 years ago

bogusan...@gmail.com commented:

Hi there. I think the base64-encoding solution can only be a stop-gap solution.

  • Downloading big files will probably exhaust memory and base64 encoding and -decoding it will use up resources that would have better been spent elsewhere - therefore we want to have the option to redirect a downloaded stream to file
  • We may have pages where we cannot control the loading of a file that is not supported (e.g. PDF)
  • We may want to save resources that have already been loaded as part of the page (e.g. images)

I think the optimal solution would be to add functionality to the onResourceReceived hook to allow setting up a "redirection" handler, and if such a handler is set, unsupported file formats should silently be downloaded. This handler could then have another onDownloadFinished hook to resume operation once the download is done.

JamesMGreene commented 11 years ago

james.m....@gmail.com commented:

 

 
Metadata Updates

subelsky commented 11 years ago

I'm interested in committing some of my company's resources to adding this feature. Is anyone already working on it? If so, could my company sponsor your work? If not, we can assign it to one of our own people. I just want to avoid duplicating anyone else's work.

MichaelCation commented 11 years ago

I'm also interested in helping with this feature. We're trying to capture an Acrobat file that is sent as a result of a POST with the Content-Disposition header 'attachment;filename="file.pdf"' Is anyone working on this? I don't want to duplicate effort. Ideally we want to access the functionality from CasperJS as well.

maxcan commented 11 years ago

any progress on this?

extempore commented 11 years ago

I'd love to see this fixed too. I saw @Vitallium has a fork with download support, as well as a few other fixes.

https://github.com/Vitallium/phantomjs/tree/download-support

Anyone else able/available to help merge the new code? I wouldn't be doing anyone a favor if I messed with the C codebase. I wouldn't mind donating towards a bounty for this.

vitallium commented 11 years ago

This feature is under development. When it's ready, it'll be merged into the master tree. I can't say when this feature will be ready.

FergusNelson commented 11 years ago

I'm also interested in this issue. Will we be able to render the pdf content as png / jpeg? Or is that altogether a different problem?

chbrown commented 11 years ago

@FergusNelson that's a different problem, but much more easily solved using ghostscript, X11, ImageMagick, etc.

subelsky commented 11 years ago

looks like @Vitallium is pretty far along with an awesome solution in his download-support branch, described here: https://groups.google.com/forum/#!msg/phantomjs/JChUakj--24/epby47h3ZGAJ

matthewlmcclure commented 11 years ago

I see that there are at least two attempts to address this issue on GitHub. @woodwardjd's add_download_capabilities branch, and @Vitallium's download-support branch. Is one of those a more promising path forward than the other? What work is outstanding before it would be ready to merge upstream?

0o-de-lally commented 11 years ago

@Vitallium how close is this to being merged with the master?

matthewlmcclure commented 11 years ago

I rebased @Vitallium's download-support branch on a recent master HEAD.

I've been exercising it with a happy path test case, and it seems to be working fine.

@ariya and @Vitallium,

I'd like to continue the work that @Vitallium started if there's more to do.

What do you think blocks merging this upstream?

vitallium commented 11 years ago

I'm actually want to rework the 'download-support' branch. I want to make it similar to real browsers. But I didn't post my ideas to the mailing-list yet (https://groups.google.com/forum/#!topic/phantomjs/JChUakj--24). So, i want to:

masahirominami commented 11 years ago

hi, we are having trouble with downloading files too, we gave a try to download-support branch code; but onFileDownload() callback seems not called - and we are assuming that it's because the web page does not return "content-disposition" header, but only "application/octet-stream" content type. (As the target page is not our code we can't change anything on server side.)

It seems that the phantomjs stops executing at clicking "download" button. So we are actually not very much sure if it is onFileDownload is not called, or the whole process is lost and suspended somewhere. However, we still are thinking that it is because of "application/octet-stream" content-type header.

I'm not sure if i'm making myself clear but we want to know if 1) our understanding is correct about missing Content-disposition header, 2) will Vitallium's DownloadManager solve this problem, and finally, 3) if yes, if it will be available sometime soon (say, within a month).

Thank you, minami

UPDATE: it seems this one works in our case: https://github.com/ariya/phantomjs/pull/11484

thank you

leomao10 commented 11 years ago

May I ask what is the progress for this function?

simonweil commented 10 years ago

:+1:

mentero commented 10 years ago

:+1:

momogentoo commented 10 years ago

For some cases, one workaround is enabling phantomjs cache and scanning cache directory to retrieve that downloaded attachment.

vitallium commented 10 years ago

This feature will be in the next version. So, stay tuned! On Apr 2, 2014 7:52 PM, "momogentoo" notifications@github.com wrote:

For some cases, one workaround is enabling phantomjs cache and scanning cache directory to retrieve that downloaded attachment.

— Reply to this email directly or view it on GitHubhttps://github.com/ariya/phantomjs/issues/10052#issuecomment-39347465 .

thomasmodeneis commented 10 years ago

up!

barek2k2 commented 10 years ago

Need this ASAP :-)

ehartford commented 10 years ago

+1

ersatzryan commented 10 years ago

@Vitallium do you have any details about when that will be?

pilavdzic commented 10 years ago

For those who need file download ability now, from what I understand casperjs solves this.

pilavdzic commented 10 years ago

Correction. I tried out casperjs and downloading large files does not work, they are 0 bytes. CasperJS folks say this relates to another bug in phantomjs, inability to set a larger timeout value. Please fix these bugs, downloading large files is very important for automation and testing!

Schweinepriester commented 10 years ago

push!

xaviershay commented 10 years ago

Happy to beta test anything here.

I'm trying to download an xslx file and get access to the content.

realtebo commented 10 years ago

+1 for fix large timeout bug

I need to download an excel of 25 MB, every day, at same time. After login, search, and so on.

So casperJs was my friend ... could be my friend,because for this bug I cannot download the file ... sgrunt !!!!

simonweil commented 10 years ago

@realtebo, did you try using CasperJS with SlimerJS? Because of PhantomJS bugs I use SlimerJS and it works very well.

dimzon commented 10 years ago

I need this too ASAP

Lee-Nover commented 10 years ago

my current workaround is to use an XMLHttpRequest to GET the file as 'arraybuffer' inside page.evaluate() so we keep the page context with cookies and all, then use the 'fs' module to write the binary data.

              var results = page.evaluate(function () {
                  // downloads have to be in the context of the web page
                  function downloadReport(id, name) {
                      console.log('downloading: ' + name);
                      var result = {};
                      try {
                          var xhr = new XMLHttpRequest();
                          xhr.open("GET", "http://host/api/v1/reports/" + id, false);
                          xhr.responseType = 'arraybuffer';
                          xhr.send(null);
                          var bin = xhr.response;
                          var u8 = new Uint8Array(bin), ic = u8.length, bs = [];
                          while (ic--) { bs[ic] = String.fromCharCode(u8[ic]); };
                          result.data = bs.join('');
                          result.name = name;
                      } catch (e) {
                          result.error = JSON.stringify(e);
                      }
                      return result;
                  }

                  var result = [];
                  result.push(downloadReport(123, 'report.pdf'));
                  return result;
              }, token);

              results.forEach(function (item) {
                  if (item.data != null)
                      fs.write(item.name, item.data, { mode: 'wb' } );
                  else
                      console.log(item.error);
              });
Loknar commented 10 years ago

+1

ecdeveloper commented 10 years ago

+1

ecdeveloper commented 10 years ago

I came up with another workaround. From within page.evaluate I click on the link I need to download, then listen for onResourceReceived.

page.set('onResourceReceived', function (resource) {
     if (resource.contentType && resource.stage === 'end' && resource.contentType.indexOf('application/pdf') > -1)  {
          console.log(resource);
          // Here you can download the file from resource.url by using http(s) request (e.g. https://gist.github.com/ialpert/3136595)
}
})