entrepreneur-interet-general / OpenScraper

An open source webapp for scraping: towards a public service for webscraping
http://www.cis-openscraper.com/
MIT License
93 stars 22 forks source link

How to debug spider? #40

Open thibault opened 5 years ago

thibault commented 5 years ago

Hi @JulienParis,

I'm testing my own instance of OpenScraper.

So far, despite reading the documention, I've been unable to get any real data out of OpenScraper.

I've defined a simple data model (one field), added a simple contributor, but when I "Crawl" the spider, the dataset stays empty.

Now, I'm not too sure where to go from here. I've tested and re-tested my xpaths expressions, and although I might be wrong, it seems to me everything is ok here. How do I get feedback about the scraping results? How do I know what happened during the scrolling and what went wrong exactly?

JulienParis commented 5 years ago

For now the only way to get feedbacks while scrapping is to run it with the terminal open (for instance having your local instance run from the terminal and checking the outputs, or checking the log files)...

Could you share your scraper config (screenshot) to get an idea how you had your first try ?

DavidBruant commented 5 years ago

Hi @thibault, good to see you here :-) (i don't have answers to your questions, just saying hi :-) )

JulienParis commented 5 years ago

@thibault I'm also trying with my own instance but get no results from "http://www.ademe.fr/actualites/appels-a-projets "... same as you :( ... Trying to figure out what is the bug...

I tried with that :

::: INFO log_pipeline 181121 18:58:15 ::: pipelines:80 -in- __init__() :::      >>> MongodbPipeline / __init__ ...
::: INFO log_pipeline 181121 18:58:15 ::: pipelines:87 -in- __init__() :::      --- MongodbPipeline / os.getcwd() : /Users/jpy/Dropbox/_FLASK/_CIS/_POC_EIG/CIS_scrapnado/openscraper

::: INFO scrapy.middleware 181121 18:58:15 ::: middleware:53 -in- from_settings() :::       Enabled item pipelines:
    ['scraper.pipelines.MongodbPipeline']
::: INFO scrapy.core.engine 181121 18:58:15 ::: engine:256 -in- open_spider() :::       Spider opened
::: DEBUG log_pipeline 181121 18:58:15 ::: pipelines:116 -in- open_spider() :::         >>> MongodbPipeline / open_spider ...

::: INFO scrapy.extensions.logstats 181121 18:58:15 ::: logstats:48 -in- log() :::      Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
::: INFO log_scraper 181121 18:58:15 ::: masterspider:354 -in- start_requests() :::         --- GenericSpider.start_requests ...
::: INFO log_scraper 181121 18:58:15 ::: masterspider:358 -in- start_requests() :::         --- GenericSpider.start_requests / url : http://www.ademe.fr/actualites/appels-a-projets
::: INFO log_scraper 181121 18:58:15 ::: masterspider:363 -in- start_requests() :::         --- GenericSpider.start_requests / starting first Scrapy request...
::: INFO scrapy.core.engine 181121 18:58:16 ::: engine:295 -in- close_spider() :::      Closing spider (finished)
::: DEBUG log_pipeline 181121 18:58:16 ::: pipelines:137 -in- close_spider() :::        >>> MongodbPipeline / close_spider ...

Very weird indeed

Meanwhile you can start to try with this website to check if it's the code or the website creating trouble : capture d ecran 2018-11-21 a 18 56 20 capture d ecran 2018-11-21 a 18 56 32 capture d ecran 2018-11-21 a 18 56 40

JulienParis commented 5 years ago

... I added the quotestoscrap scraper and its working fine... It must be something related to the ademe website (or the scrapy settings because requests are doing fine)... I tried that with a pure request from a python shell :

>>> import requests
>>> r = requests.get('http://www.ademe.fr/actualites/appels-a-projets')
>>> print r.content

and no problem... So it's Scrapy or the website

JulienParis commented 5 years ago

@thibault I think I got it !! There is something going wrong with the scrapy settings... I commented the line 139 in masterspider.py file : this one --> settings.set( "RANDOMIZE_DOWNLOAD_DELAY" , RANDOMIZE_DOWNLOAD_DELAY ) And then I could scrap the ademe website.

So you could either comment this same line on your instance, or change the RANDOMIZE_DOWNLOAD_DELAY var to false( RANDOMIZE_DOWNLOAD_DELAY = False in you settings_scrapy.py file)... Or even better I could add this option in the "advanced settings" as a new feature ...

JulienParis commented 5 years ago

@thibault so I added some new features in "advanced settings" with this commit : https://github.com/entrepreneur-interet-general/OpenScraper/commit/92d99089b7c01b903b3a5e005447ad6bfbc7d47f

This allows to override the default Scrapy settings with your own advanced settings. For instance in your case with Ademe those settings seems to work :

capture d ecran 2018-11-21 a 20 41 55

thibault commented 5 years ago

@JulienParis Wow, it seems I gave you work for the entire afternoon :)

Thank you for taking the time to help. I will try your solution, and will get back to you with the results.

@DavidBruant Hi ! :)