NikolaiT / GoogleScraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.
https://scrapeulous.com/
Apache License 2.0
2.6k stars 734 forks source link

Why remove the bug has been fixed before #73

Open leadscloud opened 9 years ago

leadscloud commented 9 years ago
def store(self):
        """Store the parsed data in the sqlalchemy scoped session."""
        assert self.session, 'No database session.'

        with self.db_lock:
            serp = parse_serp(parser=self.parser, scraper=self)

            self.scraper_search.serps.append(serp)

            self.session.add(serp)
            self.session.commit()

            store_serp_result(serp)

            if serp.num_results:
                return True
            else:
                return False

if the page is access denied. or broken page, the cache also be save it.

def store(self):
        """Store the parsed data in the sqlalchemy scoped session."""
        assert self.session, 'No database session.'

        with self.db_lock:
            serp = parse_serp(parser=self.parser, scraper=self)

            self.scraper_search.serps.append(serp)

            if serp.num_results:
                self.session.add(serp)
                self.session.commit()
                store_serp_result(serp)
                return True
            else:
                return False
leadscloud commented 9 years ago
if not self.store():
            logger.error(
                'No results to store for keyword: "{}" in search engine: {}'.format(self.query,
                                                                                    self.search_engine_name))

to

if not self.store():
            logger.error(
                'No results to store for keyword: "{}" in search engine: {}'.format(self.query,
                                                                                    self.search_engine_name))
            return
NikolaiT commented 9 years ago

Because when we couldn't store any serp results we still want to know that the page was not successfully resquested. In this case:


            if serp.num_results:
                self.session.add(serp)
                self.session.commit()
                store_serp_result(serp)
                return True
            else:
                return False

we dismiss all negative results.

But you are somewhat correct. This could be still improved...