geoblink / web-scraper-chrome-extension

Web data extraction tool implemented as chrome extension
GNU Lesser General Public License v3.0
17 stars 12 forks source link

No scraped data returned #4

Closed grinono closed 6 years ago

grinono commented 6 years ago

I got a sitemap that works from the extension but not in headless mode. I got an empty array returned. Any idea what the reason for this is?

const options = {} // optional delay and pageLoadDelay
const sitemap = {
    "_id" : "test",
    "channel" : "test",
    "sitemap": {"_id":"phone","startUrl":["https://tweakers.net/categorie/215/smartphones/producten/"],"selectors":[{"id":"title","type":"SelectorText","selector":"a.editionName","parentSelectors":["_root"],"multiple":true,"regex":"","delay":0}]}
  }

export function startScraping (sitemap) {
    // console.log(sitemap.sitemap)
    webscraper(sitemap.sitemap, options)
      .then((scraped) => {
        console.log('data below')
        console.log(scraped)
        return ('scraping done')
      })
      .catch((reason) => {
        console.log(reason)
        return reason
      })
}

startScraping(sitemap)
grinono commented 6 years ago

the following sitemap:

{"_id":"icodrops","startUrl":["https://icodrops.com/ico-stats/"],"selectors":[{"id":"projectpage","type":"SelectorLink","selector":"div.statas a#n_color","parentSelectors":["_root"],"multiple":true,"delay":0},{"id":"website","type":"SelectorLink","selector":"div.ico-right-col > a:nth-of-type(1)","parentSelectors":["projectpage"],"multiple":false,"delay":0},{"id":"projectName","type":"SelectorText","selector":"article.post-20539 h3","parentSelectors":["projectpage"],"multiple":false,"regex":"","delay":0}]}

returns:

(node:87827) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 1): TypeError: Cannot match against 'undefined' or 'null'. (node:87827) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

any idea where this unhandeld promiss is from?

called codeblock

  webscraper(map.sitemap, options)
      .then(function (scraped) {
        console.log('data below')
        console.log(scraped)
        return ('scraping done')
      })
      .catch((reason) => {
        console.log('error found')
        console.log(reason)
        return reason
      })
grinono commented 6 years ago

Debug is pointing to JSDOMBrowser.js

(node:88958) UnhandledPromiseRejectionWarning: TypeError: Cannot destructure property `$` of 'undefined' or 'null'.
    at /web-scraper-headless/extension/scripts/JSDOMBrowser.js:47:35
    at JSDOM.fromURL.then.catch.e (/web-scraper-headless/extension/scripts/JSDOMBrowser.js:24:21)
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:118:7)
(node:88958) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:88958) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
furstenheim commented 6 years ago

Hi @grinono, There are two issues here. The unhandled rejection and not getting data. First one should get fixed in the line you mention JSDOMBrowser.js:47:35, there is function (err, {$, document, window}) and that raises an error when the second parameter is null. So it should be err, options.

Second one is harder. I'm pretty sure it is because jsdom does not execute JS. I have a task in my backlog to include Chrome Headless but I doubt that it will happen any time soon. If you want to give it a try I have a general idea of what it should be done.

grinono commented 6 years ago

Hi Fursenheim, yes, chrome headless would be great. Then we can support client side routing webApps. Though the extension is not opensource supported anymore. So it will be just the 2.X versions. Or a community contribution is needed to keep the extension up to date. What is your opinion regarding this? though i think the combination of the client side sitemap creation and serverside scraping is a really strong plus on this project. Could u describe the general idea regarding what should be done.

furstenheim commented 6 years ago

I'm not sure I understand the part on the extension. Right now the headless library is mostly the same as the non headless. Nevertheless, when I want to use it I normally load it in Chrome in developer mode (you need to run build first to generate the bundle).

As for the headless part. Right now, the library has two browser files, Chrome and JSDOM. First one is used in extension mode and second one was added to support server mode. Their task is to open a website and extract the data from that website. Chrome works via a message, jsdom directly on the fake dom. We should add a third browser ChromeHeadless that would start a chrome browser when created and would navigate to each page extracting the website. The library would decide depending on some options in the input whether to use one or the other (if there is no need for js then jsdom is a better option since it will be lighter and more portable).

Opening Chrome Headless would definitely be using puppeteer as it is most convinient. There is a hard part, though, which is loading the code in Chrome Headless. When one is using normal Chrome the code is executed in a different context, so we don't have to worry about polluting variables, for example we can freely use an old version of jquery while the web is using a different version. For that it is important to create an isolated world, there is no API exposed in puppeteer to do it, but it can be done. Client in puppeteer allows to use raw debugging protocol which with the previous method will return you a context id. With that you can create a puppeteer page and execute code (first load the script, then scrape).

Hope it helps

grinono commented 6 years ago

yes, the chrome extension API is not available in the current Chrome headless aka puppeteer. So All webrequest like chrome.tabs.query,chrome.tabs.sendMessage and also window. calls can not be reused. I have not dived to deep into the webscraper.io extension logic. But would it not be workable to extracting the sitemap logic and plot it into a puppeteer script. To have a cleaner solution.

https://github.com/segmentio/daydream Some solution like this, were we add one or more hooks into the puppeteer script and for example parse window.open request page.goto() and then run the jquery selectors for that page.

furstenheim commented 6 years ago

There is no need to use the extension api. The logic is already decoupled (from adding the jsdom interface).

One needs only two things: navigate to a given url and execute some js in the website. First one can be done with puppeteer.goto, second one can be done with page.evaluate (but using the context as I mentioned to avoid clashing of variables).

If you want to check the source code, the key parts for JSDOM are: navigating and executing. The function that is executed is this one. It is already working in JSDOM, so as I said there is no need for the extensions api.

grinono commented 6 years ago

I have been looking into the logic of the extension. Doing some tests to figure out how to include puppeteer as browser. I think it would be a great add-on as i'm having issues with JSDOM. Would be nice to have a solution that works 100% of the time. Anyhow, after a dive, I keep struggling with the underlying inner workings. Writing a scraper in puppeteer was my shortcut for now. I think it would be more effective if you implement it with your JSDOM implementation experience. I would love to support in any other way, as i'm convinced this could be the best serverside opensource scraping solution available.

furstenheim commented 6 years ago

@grinono Fix to the easy issue (catching errors) https://github.com/geoblink/web-scraper-chrome-extension/pull/5

Btw I've tried your sitemap in local and it works perfectly fine [ { title: 'Motorola Moto G6 Plus Blauw' }, { title: 'Samsung Galaxy S8 Zwart' },...., can it be that you are blocked?

furstenheim commented 6 years ago

@grinono I've added support for headless chrome in https://github.com/geoblink/web-scraper-chrome-extension/pull/6

In case you are curious, what I referred in the previous comments about creating a new context is the following: https://github.com/geoblink/web-scraper-chrome-extension/pull/6/files#diff-601d2aa2158866827d3b806038d571e3R55

It avoids possible issues like in this test: https://github.com/geoblink/web-scraper-chrome-extension/pull/6/files#diff-d036ece49e0607564327a659183a7b47R78

I'll merge the prs in a couple of days and bump the version

furstenheim commented 6 years ago

Fixed both issues in 1.0.6