danielnieto / scrapman

Retrieve real (with Javascript executed) HTML code from an URL, ultra fast and supports multiple parallel loading of webs
MIT License
21 stars 3 forks source link

Javascript not rendered? #2

Closed kirchnch closed 7 years ago

kirchnch commented 7 years ago

Does scrapman fully render webpages with javascript? Looks like there are about 300 more links when scraping www.nytimes.com with nightmare, even though both packages use electron:

Scrapman.js.txt Nightmare.js.txt

Nice idea on parallel requests btw.

danielnieto commented 7 years ago

Yes it does, it fetches a webpage and after all of the DOM has been loaded it retrieves the html code... I can see that nytimes.com loads a bunch of javascript files that take several seconds so I guess the difference is "when" that code is retrieved (I manually did document.querySelectorAll("a").length while loading that page, and everytime it returns more and more links), scrapman uses document's DOMContentLoaded event to know when the document's DOM has finished loading, not all of its resources (css styles, images, etc) I took this approach in favor or speed, however, in this case I think the best approach is to wait until all resources have been loaded using windows' load event.

I'll test that approach and will get back later today at this issue once I have results...

kirchnch commented 7 years ago

Thanks for looking into this. It would be nice to have the option to load additional resources and be able to run requests from multiple electron instances at the same time. I believe that would provide scalability while working on projects where "all" resources from a lot of links are required, and speed compared to running scrapes in series. Looking through their code, Nightmare appears to force additional rendering by highlighting a portion of the page (see their Frame-Manager):

Nightmare's Frame-Manager:

  // Force the browser to render new content by using the debugger to
  // highlight a portion of the page. This way, we can guarantee a change
  // that both requires rendering a frame and does not actually affect
  // the content of the page.

  parent.emit('log', 'Highlighting page to trigger rendering.');
  window.webContents.debugger.sendCommand('DOM.enable')
  window.webContents.debugger.sendCommand(
    'DOM.highlightRect', HIGHLIGHT_STYLE, function(error) {
      window.webContents.debugger.sendCommand('DOM.hideHighlight');
      window.webContents.debugger.detach();
    });
danielnieto commented 7 years ago

I've been looking into the original issue, I believe that nytimes.com fetches more links via AJAX after the page has finished loading. It is true that both scrapman and nightmare use Electron, but we use it in a slightly different way, they load a new instance of electron for each request, while scrapman use a single electron instance and a single "window" but dynamically adds webview [http://electron.atom.io/docs/api/web-view-tag/] to that unique window, to load several pages at the same time, this is why scrapman is faster, it only boots electron once, and adds and removes webview tags (which work very similar toiframe) at each request into that loaded electron instance.

Having said that I do believe that you get more links from scrapping with nightmare just because it is slower than scrapman, I've tested using a timeout to retrieve the HTML of a webpage and it returns the same number of links, but if you cut down that timeout to half, also the links are reduced. So, after the page has finished loading, it takes about 4 seconds to load the rest of the links. Scrapman retrieves the HTML once the page has finished loading, and apparently Nightmare fetches it a little later.

I've tried to solve this by using window.onload(to wait for all resources to load as well), and taking a different approach on retrieving the HTML by using webview's did-finish-load event instead of my custom injected script that listens for DOMContentLoaded event, but I got exactly the same result.

The only way I can think of resolving this, is to implement a wait configuration parameter where you can explicitly tell scrapman to wait x miliseconds instead of giving you the html as soon as it's loaded.

So, that's what I'm going to implement; a wait option. I'll do it as soon as I can

Regarding your other question I frankly did not understand what you mean with it.

kirchnch commented 7 years ago

A time out would be great. I recall using one the timeout in scrapy-splash, but without any correlation to the number of links for this same site. Regarding the "other question", are you referring to the statement about Nightmare's Frame-Manager and highlighting the page to render? Frankly, I don't understand what they are doing, but based on their code comments in the snippet from yesterday, it seems like they're loading additional content that way. As an aside, I'm curious how chromium, used by electron of course, determines when the page has been completely loaded, since their spinning status icon on the page tab stops when all page content seems to have been retrieved (see green box in picture).

selection_004

danielnieto commented 7 years ago

I've implemented a wait parameter to configuration object, you can use it as follows:


const scrapman = require("scrapman");

scrapman.configure({
    wait: 5500
});

scrapman.load("https://danielnieto.github.io/scrapman-test-page", function(html){
    console.log(html);
})

as you can see in this example, that page is rendered by javascript (knockoutjs), and after it has been completely loaded and javascript executed, then it triggers the did-finish-load event that scrapman listens to , normally at this point is where scrapman would retrieve the rendered html, but with wait option, you can set a timeout, I've set up that page to have a timer which will increment a value inside div#time-elapsed every second, so, in this example, scrapman will wait 5.5 seconds after the page has been fully loaded and javascript rendered, then it will retrieve the html code, and if you look into div#time-elapsed you will see it has a 5 in it.

You can use this feature by updating the npm package in your installation, I've released a new version 2.3.0


Now, to answer your other questions, I've looked into the code snippet from Nightmare's Frame-Manager and that is to force rendering in screen(drawing) of the page to be able to capture it, as you know Nightmare is not a scrapper but more of an automatized browser for testing purposes in which cases you most likely need to take screenshots of pages to visually check if everything is OK. Scrapman was born when I tried to use Nightmare for scrapping hundreds of URLS, and I was overwhelmed by the fact that Nightmare could not handle parallel request on the same electron instance, the only way to kind of achieve this is to start several Nightmare instances, imagine you have 50 chrome browsers open at the same time including the time they need to boot, which is super laggy. I asked this question on their repo and that is the reason why I created the scrapman

Also, did-finish-load event gets triggered when the "spinner in the tab stops spinning", so, it determines that it's fully loaded when the all images styles and scripts on the page are loaded and executed. But scripts can be still running in "background" as you can see in the example page I mentioned earlier.

I see that you are using linux? Have you come across any problem using Scrapman on Linux? I never tested it on that OS, just Mac and Windows.

danielnieto commented 7 years ago

Wait.... I just tweaked my code and was able to get all of the +600 links without the wait option. It has to do with the webview not getting "visually rendered" and thus the webpage loads less stuff or chromium thinks it loaded before it actually happened? I'm not sure, but I know how to fix it, I will come up with some configuration to disable that scrapman feature and it will work like a charm for your use case, give me a couple of days. Will get back to this thread once it's implemented

kirchnch commented 7 years ago

The non-correlation with the time out makes sense based on the results from scrapy-splash code that gave 300 links less than nightmare regardless of setting the time out to 10 seconds. I definitely appreciate your insight into how the browsing loading works. Scrapman seems to work fine on Linux so far. There hasn't been too much exposure with it since we have been looking at different scrapers trying to get an idea of which one is suitable for the intended use. I'll send a 5 minute snippet of what we're actually working on to your email if your curious.

danielnieto commented 7 years ago

I researched a little bit more and it turns out that I was hiding webviews by setting visibility: hidden for its css, according to this paragraph on Electron's documentation

webview has issues being hidden using the hidden attribute or using display: none;. It can cause unusual rendering behaviour within its child browserplugin object and the web page is reloaded, when the webview is un-hidden, as opposed to just becoming visible again.

that's why it had an odd behavior that I couldn't figure out, apparently this is a known issue but hasn't been addressed yet, either way, I just took their recommended approach to hiding the webview although it is something that has not effect whatsoever to scrapman functionality, I did that only for debugging purposes but it turned out to be a major issue that ultimately caused this bug. I think we can call this problem resolved and if you're okay we can close this issue.

You can use the updated npm package I've just rolled out version 2.3.1 which includes this fix. Feel free to open a new issue if you run into some other problem

kirchnch commented 7 years ago

Thanks for fixing the issue! Unfortunately, it may not completely be without issue. Using parallel requests seems to generate link counts that are off sometimes, and depending on how many requests. I'll post the code in a new issue if your interest.