entrepreneur-interet-general / OpenScraper

An open source webapp for scraping: towards a public service for webscraping
http://www.cis-openscraper.com/
MIT License
93 stars 22 forks source link

Scraper halts upon meeting a link with unicode characters #43

Open thibault opened 5 years ago

thibault commented 5 years ago

Hi,

I've been sucessfully setting up an openscraper instance. Unfortunately, the spider always stops scraping after 15 results.

After a bit of investigation, here is what seems to be the problems that put the spider to an halt:

::: ERROR scrapy.core.scraper 181122 13:46:58 ::: scraper:158 -in- handle_spider_error() :::        Spider error processing <GET https://www.ademe.fr/actualites/appels-a-projets> (referer: None)
    Traceback (most recent call last):
      File "/home/openscraper/.virtualenvs/openscraper/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
        yield next(it)
      File "/home/openscraper/.virtualenvs/openscraper/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
        for x in result:
      File "/home/openscraper/.virtualenvs/openscraper/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
        return (_set_referer(r) for r in result or ())
      File "/home/openscraper/.virtualenvs/openscraper/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/home/openscraper/.virtualenvs/openscraper/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/home/openscraper/OpenScraper/openscraper/scraper/masterspider.py", line 609, in parse
        log_scrap.info(" --> follow_link CLEAN ({}) : {} ".format(type(follow_link),follow_link) )
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 45: ordinal not in range(128)

Problem was generated by line 600 and line 609.

After analysing the trace, the problem arise when the spider tries to follow this link:

https://appelsaprojets.ademe.fr/aap/H2mobilité2018-82#resultats

So it seems openscraper has a problem handling non purely ascii links.

JulienParis commented 5 years ago

I don't think it's the scraper per say, I would say it's the log causing the spider to stop due to this error : the .format function could go bersek with accents when it's used in the log (here in the log_scrap)...

I remember I had the same issue before... so for a start I would replace : log_scrap.info(" --> follow_link CLEAN ({}) : {} ".format(type(follow_link),follow_link) ) by log_scrap.info(" --> follow_link CLEAN (%s) : %s ", %(type(follow_link),follow_link) )

let us know if it's working at those lines, so you could fix it up by a PR

thibault commented 5 years ago

Well, the problem is that you are passing unicode type variables into a binary string without an explicit encoding. Since Python 2 tries to silently convert between the two types on the fly, it will be ok most of the time, but as soon as the string will not be pure ascii, an error wil be raised.

There are several ways to fix this.

  1. You could encode data every time you want to log it:
 log_scrap.info(" --> follow_link CLEAN ({}) : {} ".format(type(follow_link), follow_link.encode('utf-8')))
  1. You could import unicode_litterals in every file to make sure all strings are unicode and not binary.
from __future__ import unicode_litterals

 log_scrap.info(" --> follow_link CLEAN ({}) : {} ".format(type(follow_link), follow_link)))
  1. You could prefix every string variable with « u » to make sure it is unicode and not binary.
 log_scrap.info(u" --> follow_link CLEAN ({}) : {} ".format(type(follow_link), follow_link)))

I will publish a PR with the third solution that allowed me to scrap the entire ademe site without errors, but you might want to check the codebase for places where unicode and binary are mixed. Porting the project to python 3 could also help, since Python 3 does not silently cast unicode and binary.