NikolaiT / GoogleScraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.
https://scrapeulous.com/
Apache License 2.0
2.6k stars 734 forks source link

Continue last scrape error #56

Open leadscloud opened 9 years ago

leadscloud commented 9 years ago
last_modified = datetime.datetime.fromtimestamp(os.path.getmtime(last_search.keyword_file))

to

last_modified = datetime.datetime.utcfromtimestamp(os.path.getmtime(last_search.keyword_file))

when store serp

sqlalchemy.exc.InvalidRequestError: This Session's transaction has been rolled back by a nested rollback() call.  To begin a new transaction, issue Session.rollback() first.

my modify

def store(self):
        """Store the parsed data in the sqlalchemy scoped session."""
        assert self.session, 'No database session. Turning down.'

        with self.db_lock:
            serp = SearchEngineResultsPage(
                search_engine_name=self.search_engine,
                scrapemethod=self.scrapemethod,
                page_number=self.current_page,
                requested_at=self.current_request_time,
                requested_by=self.ip,
                query=self.current_keyword,
                num_results_for_keyword=self.parser.search_results['num_results'],
            )
            self.scraper_search.serps.append(serp)

            serp, parser = parse_serp(serp=serp, parser=self.parser)
            # if have no result, skip store
            if serp.num_results == 0:
                return False
            try:
                self.session.add(serp)
                self.session.commit()
            except:
                return False

            store_serp_result(dict_from_scraping_object(self), self.parser)
            return True

scraping.py if the result page have no serp, do not store it.

def after_search(self):
        """Store the results and parse em.

        Notify the progress queue if necessary.

        Args:
            html: The scraped html.
        """
        self.parser.parse(self.html)
        if not self.store():
            logger.error("No results for store, skip current keyword:{0}".format(self.current_keyword))
            self.search_number += 1
            return
        if self.progress_queue:
            self.progress_queue.put(1)
        self.cache_results()
        self.search_number += 1

caching.py

serp = None #get_serp_from_database(session, query, search_engine, scrapemethod)

to

serp = get_serp_from_database(session, query, search_engine, scrapemethod)

            if not serp:
                serp, parser = parse_again(fname, search_engine, scrapemethod, query)

            serp.scraper_searches.append(scraper_search)
            session.add(serp)
            # my added
            if num_cached % 200 == 0:
                session.commit()
NikolaiT commented 9 years ago

Very good. Implemented. :)

I like especially the % 200, it makes things faster!

NikolaiT commented 9 years ago

I fixxed the threading issue

sqlalchemy.exc.InvalidRequestError: This Session's transaction has been rolled back by a nested rollback() call.  To begin a new transaction, issue Session.rollback() first.

with

engine = create_engine('sqlite:///' + db_path, echo=echo, connect_args={'check_same_thread': False})

in database.py

Your idea is bad:

try:
     self.session.add(serp)
     self.session.commit()
except:
    return False

because we will NOT save results (will always catch the error).

It works now (at least for sqlite3)!

leadscloud commented 9 years ago

scraping.py

except self.requests.ConnectionError as ce:
            logger.error('Network problem occurred {}'.format(ce))
            raise ce
        except self.requests.Timeout as te:
            logger.error('Connection timeout {}'.format(te))
            raise te

the raise Cause the program to stop

except self.requests.ConnectionError as ce:
            logger.error('Network problem occurred {}'.format(ce))
            return False
        except self.requests.Timeout as te:
            logger.error('Connection timeout {}'.format(te))
            return False
leadscloud commented 9 years ago

if always raise except , it often caused the program to stop

NikolaiT commented 9 years ago

Fixed.

I thought about a scraping policy. What do you think?

"""
GoogleScraper should be as robust as possible.

There are several conditions that may stop the scraping process.

- All proxies are detected and we cannot request further keywords => Stop.
- No internet connection => Stop.

- If the proxy is detected by the search engine we try to get another proxy from the pool and we call switch_proxy() => continue.

- If the proxy is detected by the search engine and there is no other proxy in the pool, we wait {search_engine}_proxy_detected_timeout seconds => continue.
    + If the proxy is detected again after the waiting time, we discard the proxy for the whole scrape.
"""
leadscloud commented 9 years ago

That's good

leadscloud commented 9 years ago

I met new problem. StopScrapingException is not a goold solution.

for a thread, if the scrape 5000keywords, and if the proxy is not stable, just in the middle period of the proxy can not be used. Under the current rules, the thread will stop.

we can set return to continue

def blocking_search(self, callback, *args, **kwargs):
        """Similar transports have the same search loop layout.

        The SelScrape and HttpScrape classes have the same search loops. Just
        the transport mechanism is quite different (In HttpScrape class we replace
        the browsers functionality with our own for example).

        Args:
            callback: A callable with the search functionality.
            args: Arguments for the callback
            kwargs: Keyword arguments for the callback.
        """
        for i, self.current_keyword in enumerate(self.keywords):

            self.current_page = self.start_page_pos

            for self.current_page in range(1, self.num_pages_per_keyword + 1):

                # set the actual search code in the derived class
                try:
                    if not callback(*args, **kwargs):
                        self.missed_keywords.add(self.current_keyword)
                except StopScrapingException as e:
                    # Leave search when search engines detected us
                    # add the rest of the keywords as missed one
                    logger.critical(e)
                    self.missed_keywords.add(self.keywords[i])
                    continue
NikolaiT commented 9 years ago

It's another case which I haven't programmed yet. We are right now talking about the stability of proxies. You say: if one proxy already processed 5000 requests and it suddenly stops, it's very likely that it's a temporary issue and it will continue to work. So there is no need to stop scraping.

This is correct. But the more common case is, that the proxy will work in the beginning (thus passing proxy_check()) and then stops to work completely. So we need to keep track of the proxy behaviour in attributes in the Proxy class in database.py and then react accordingly.

For example:

It's very complex to program a good strategy. Id need's time.

I will need a good base strategy, that the user can edit in the configuration.