ariya / phantomjs

Scriptable Headless Browser
http://phantomjs.org
BSD 3-Clause "New" or "Revised" License
29.45k stars 5.76k forks source link

PhantomJS is not capturing images from div having style as overflow hidden. #15001

Closed rkumar-c closed 7 years ago

rkumar-c commented 7 years ago

Hi, I am using PhantomJS 2.1.1 to capture the screenshot of a webpage. The page has three divs with style overflow hidden and auto property. Unfortunately PhantomJS is capturing only top div image and ignoring other div images. Following is the URL for which I am getting issue: URL : https://item.mercari.com/gl/m56999829099/

Captured images is below:

screenshot1

Please help me out to fix it as soon as possible. Rakesh.

JustinYi922 commented 7 years ago

Please paste you code here.Perhaps somebody will help you.

bologer commented 7 years ago

@rkumar-c, what do you mean by overflow elements? For example, the menu that slides horizontally and only a few elements shown at the beginning? In this case, it will not show you all elements of the menu, because it is hidden (you will need to slide to see other elements) and you will receive the same result if you would open this webpage with a normal mobile device.

rkumar-c commented 7 years ago

I need to capture the screenshot of desktop webpage not from mobile devices. I understand that there is scroll but at least images should be captured whatever is visible. If you will look at image, in second and third frame no images are captured. I achieved this by writing custom js inside page.evaluate and here is the code:

page.evaluate(function() {

var box=document.getElementsByClassName("items-box-photo"); for(i=0; i<box.length; i++) { // var myimg = box[i].getElementsByTagName('img')[0]; var someimage = box[i].children[0].getAttribute('data-src'); console.log("image src " + someimage); box[i].innerHTML= " "; } }); But this is page specific and we can not run it capture any webpages. Now I am facing issue to capture screenshot of webpages with ajax and lazy loading like amazon.in or http://www.shopclues.com/fashion.html

So is there way to write common code and to capture screenshot of webpage like amazon or any other with lazy load webpages?

kensoh commented 7 years ago

I'm a CasperJS/PhantomJS user for couple of years. Hmm normally images don't show up because they haven't completed loading. Not sure if explicitly waiting before capturing the images will help.

bologer commented 7 years ago

@rkumar-c, make sure check that you don't have

page.settings.loadImages = false

Once I have activated this setting in the script and could not understand why images were not loading for me.

Ref: http://phantomjs.org/api/webpage/property/settings.html

JustinYi922 commented 7 years ago

Suddenly I think that I have encountered the similar problem. I only enter a few words in the html file, and did not set any length and width (https://item.mercari.com/gl/m56999829099/ here is also not set, and only has 'min-height'),and then the results render out of the picture size Is 400 * 300. It seems to be the minimum length and breadth of phantomjs. @bologer Is phantomjs really made the default settings? If so, @ rkumar-c,you can first change the width of the page you need and then open the website.

rkumar-c commented 7 years ago

I am not getting the full page screenshot for following two URLs: https://shop.adidas.co.in/#c/men-basketball-shoes/Pag-60/No-0/0 http://www.lifestylestores.com/c/men-tops-tshirts

Can anybody provide sample code to capture screenshot for above URLs.

kensoh commented 7 years ago

I'm not expert at PhantomJS, but for a general purpose web automation tool that I make (base on CasperJS/PhantomJS), the code is below. I get Adidas logo only for second row of images onwards. For the second website, it is blank image for below and also when I use Safari browser. Not sure what is wrong with the website that I cannot even see with real browser.

https://shop.adidas.co.in/#c/men-basketball-shoes/Pag-60/No-0/0
wait 10 seconds
snap page to adidas.png
http://www.lifestylestores.com/c/men-tops-tshirts
wait 10 seconds
snap page to lifestore.png
bologer commented 7 years ago

@rkumar-c your problem with the first website is the loading time. For some reason it takes around 20 seconds to load all images and everything on the website.

This is what I received on 15 secs, so I assume that you will have all images loaded in about 20 secs.

Screenshot

1

Source code

var p = p || {};

p.adidas = {

    webpage:    false,
    system:     false,
    page:       false,
    url:        false,
    userAgent:  false,
    newsJSON:   false,
    newsString: false,

    init: function() {
        this.webpage    = require('webpage');
        this.fs         = require('fs');
        this.page       = this.webpage.create();
        this.url        = 'https://shop.adidas.co.in/#c/men-basketball-shoes/Pag-60/No-0/0';
        this.userAgent  = 'Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0';
        this.timeout    = 20000;
        this.log        = '[ =========> ] ';
    },

    solve: function() {
        var self = this;
        this.page.settings.userAgent = this.userAgent;
        this.page.viewportSize = { width: 1024, height: 900 };

        this.page.open(this.url, function(status) {

            console.log(self.log + 'Page loaded');

            try {
                setTimeout(function() {

                    console.log(self.log + 'Timeout finished.');

                    self.page.render('1.png');

                    console.log(self.log + 'Picture taken');

                    console.log(self.log + 'Job should be finished now');
                    phantom.exit()
                }, self.timeout);
            } catch(e) {
                console.log('PhantomJS has unexpectable stopped working. Date: ' + new Date().toUTCString());
                phantom.exit(); 
            }
        });
    }
}

p.adidas.init();
p.adidas.solve();
rkumar-c commented 7 years ago

@bologer I tried the above code but fails to capture all the images, this is also happening with ebay and alibaba websites. What could be the reason for not capturing all the images, I increased the timeout 90000 but that also didn't work. URLs to capture: https://www.ebay.com/b/Toy-Kites/2569/bn_1924212 https://www.alibaba.com/Doors-Windows_pid100006533?spm=a2700.8293689.0.0.QF5otp

bologer commented 7 years ago

@rkumar-c, alright, I will test my code a bit more and will try to output working solution :+1:

kensoh commented 7 years ago

Also adding on, invisible browsers such as PhantomJS or Electron (through NightmareJS) are different in behaviour from real browsers such as Chrome or Safari. And website owners often add logic to prevent invisible / automated browsers from working correctly.

This is a nice post by a friend on this topic. Although I normally will not automate for websites which don't want to serve automated browsers. https://franciskim.co/dont-need-no-stinking-api-web-scraping-2016-beyond

Other possible setups could be using Selenium + Chrome + Xvfb or maybe SikuliX + Xvfb to replicate exact browsers behavior. Also, headless Chrome is here (Firefox headless soon too). Tools such as CasperJS which have intention to support headless Chrome or Chromy can also be considered.

bologer commented 7 years ago

@rkumar-c, if purpose of this threat to get images of the product from let's say Adidas, then the following code can be used. It is not required to load the images unless the whole DOM has loaded, than you can scrap the image URLs and recursively saved them.

Btw, I understood why you are not seeing all of the images, because they are shown on the scroll event. Try to emulate scroll event once you have loaded the page and you will see all of the images :+1:

If you would like to get just images of the product, than the code below would help you:

var p = p || {};

p.adidas = {

    webpage:    false,
    system:     false,
    page:       false,
    url:        false,
    userAgent:  false,
    newsJSON:   false,
    newsString: false,

    init: function() {
        this.webpage    = require('webpage');
        this.fs         = require('fs');
        this.page       = this.webpage.create();
        this.url        = 'https://shop.adidas.co.in/#c/men-basketball-shoes/Pag-60/No-0/0';
        this.userAgent  = 'Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0';
        this.timeout    = 6000;
        this.log        = '[ =========> ] ';
    },

    solve: function() {
        var self = this;
        this.page.settings.userAgent = this.userAgent;
        this.page.viewportSize = { width: 1024, height: 900 };

        this.page.open(this.url, function(status) {

            console.log(self.log + 'Page loaded');

            try {
                setTimeout(function() {

                    console.log(self.log + 'Timeout finished.');

                    var imageUrls = self.page.evaluate(function() {
                        var q = document.querySelectorAll('.productListing li.card');
                        var obj = {};
                        var length = q.length - 1;

                        for(var i = 0; i <= length; i++) {
                            obj[i] = {
                                name: q[i].querySelector('.adidasOriginals.productIdentifier').innerText.trim(),
                                src: q[i].querySelector('.productImageWrap > img').getAttribute('data-src').trim().replace(/\.plp$/gi, '')
                            };
                        }

                        return obj;
                    });

                    self.page.render('1.png');

                    console.log(self.log + 'Picture taken');

                    console.log(self.log + 'Job should be finished now');

                    console.log(self.log + 'Images:');

                    console.log(JSON.stringify(imageUrls));

                    self.fs.write('urls.txt', JSON.stringify(imageUrls));

                    phantom.exit()
                }, self.timeout);
            } catch(e) {
                console.log('PhantomJS has unexpectable stopped working. Date: ' + new Date().toUTCString());
                phantom.exit(); 
            }
        });
    }
}

p.adidas.init();
p.adidas.solve();

Though I am not sure if this is what you are looking for.

kensoh commented 7 years ago

Cool stuff! Thanks @bologer for sharing! :smile:

rkumar-c commented 7 years ago

Thanks guys for promptly replying to the issue with solutions, I really appreciate solution given by @bologer and it was very much helpful to me. Here I am closing this thread and once again thanks to @bologer.