NikolaiT / se-scraper

Javascript scraping module based on puppeteer for many different search engines...
https://scrapeulous.com/
Apache License 2.0
543 stars 123 forks source link

Results from google maps = undefined #35

Closed Krajstofer closed 5 years ago

Krajstofer commented 5 years ago

I was starting with basic configuration and I always had the same result for my keyword. My code looks like this:

const se_scraper = require('se-scraper');

(async () => {
  let browser_config = {
    debug_level: 1,
    output_file: './maps.json',
    test_evasion: false,
    sleep_range: '[1,1]',
    block_assets: false,
    headless: false,

    google_maps_settings: {
      scrape_in_detail: false,
    }
  };

  let scrape_job = {
    search_engine: 'google_maps',
    keywords: ['fryzjer'],
    num_pages: 1,
  };

  var scraper = new se_scraper.ScrapeManager(browser_config);

  await scraper.start();

  var results = await scraper.scrape(scrape_job);

  console.dir(results, {
    depth: null,
    colors: true
  });

  await scraper.quit();
})();

And this is my results from terminal:

[i] [se-scraper] started at [Thu, 11 Jul 2019 10:41:43 GMT] and scrapes google with 1 keywords on 1 pages each.
[i] Using startUrl: https://www.google.com/maps
[i] google scrapes keyword "fryzjer" on page 1
[i] Sleeping for 1s
[i] Scraper took 7457ms to perform 1 requests.
[i] On average ms/request: 7457ms/request
[i] Writing results to ./maps.json
{ results:
   { fryzjer:
      { '1':
         { time: 'Thu, 11 Jul 2019 10:41:51 GMT', results: undefined } } },
  html_output: undefined,
  metadata:
   { elapsed_time: '7457', ms_per_keyword: '7457', num_requests: 1 } }

Do you have any idea what can I do to get results data?

NikolaiT commented 5 years ago

Yes, google_maps is kinda hard to scrape. I am working on the scraper but I am overloaded with work right now..

The google maps scraping is also vastly different from other search engines, because the process looks like this:

  1. Enter a keyword in google maps
  2. Iterate through all results over N pages.
  3. Visit each maps profile separately and grab all data such as phone numbers, website, opening hours and especially review data.
Krajstofer commented 5 years ago

I checked parse_async function in GoogleMapsScraper class. I received html data, and that's ok. But I think that parse_async and evaluate functions don't have access to document element.

EDIT: OK, I see that it's important to have scrape_in_detail: true in config, but then i have only first result from 20.

[i] [se-scraper] started at [Thu, 11 Jul 2019 11:40:11 GMT] and scrapes google with 1 keywords on 1 pages each.
[i] Using startUrl: https://www.google.com/maps
[i] google scrapes keyword "fryzjer" on page 1
[i] Sleeping for 1s
Profiles to visit: 20
[ 'Reymonta 5, 60-791 Poznań',
  'CV2Q+RQ Poznań',
  '723 915 777',
  'Dodaj witrynę' ]
Error: Node is detached from document
    at ElementHandle._scrollIntoViewIfNeeded (/path/scraper/node_modules/puppeteer/lib/JSHandle.js:185:13)
    at process._tickCallback (internal/process/next_tick.js:68:7)
  -- ASYNC --
    at ElementHandle.<anonymous> (/path/scraper/node_modules/puppeteer/lib/helper.js:111:15)
    at GoogleMapsScraper.visit_profile (/path/scraper/node_modules/se-scraper/src/modules/google.js:542:23)
    at GoogleMapsScraper.parse_async (/path/scraper/node_modules/se-scraper/src/modules/google.js:523:54)
    at process._tickCallback (internal/process/next_tick.js:68:7)
NikolaiT commented 5 years ago

There are many issues right now with google maps...When I find time I will implement it properly.

Right now I am wondering if it is even better to start a new project for "location search/ small business search" because the logic is different.

NikolaiT commented 5 years ago

The main problem is that scraping google maps takes two loops instead of one.

With normal search engines: Loop over all keywords and all pages and parse results. With google maps: Loop over all search results, click on each results, parse the profile page, go to next result.

We cannot parallelize this logic in se-scraper, therefore I am hesitant.