asmaier / ImmoSpider

Immospider is a crawler for the Immoscout24 website.
183 stars 49 forks source link

scrapy crawl immoscout -o apartments.csv makes an empty file #13

Closed sajjadakram2018 closed 3 years ago

sajjadakram2018 commented 3 years ago

I followed the instruction that is mentioned in Readme. In the Simple scraping step is mentioned the output of the following command should have the list of the apartments in Berlin in apartments.csv. But, the output file is an empty file. Did I miss something? I copy the log of the command for a better understanding.

Thanks Sajjad

$ scrapy crawl immoscout -o apartments.csv -a url=https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Berlin/Berlin/-/2,50-/60,00-/EURO--1000,00 -L INFO 2021-01-19 13:41:20 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: immospider) 2021-01-19 13:41:20 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.6 (default, Sep 25 2020, 09:36:53) - [GCC 10.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 3.0, Platform Linux-5.8.0-36-generic-x86_64-with-glibc2.32 2021-01-19 13:41:20 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'immospider', 'LOG_LEVEL': 'INFO', 'LOG_STDOUT': True, 'NEWSPIDER_MODULE': 'immospider.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['immospider.spiders']} 2021-01-19 13:41:20 [scrapy.extensions.telnet] INFO: Telnet Password: 4a1f6f3d22013ab8 2021-01-19 13:41:20 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats', 'immospider.extensions.SendMail'] 2021-01-19 13:41:20 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2021-01-19 13:41:20 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2021-01-19 13:41:20 [scrapy.middleware] INFO: Enabled item pipelines: ['immospider.pipelines.GooglemapsPipeline', 'immospider.pipelines.DuplicatesPipeline'] 2021-01-19 13:41:20 [scrapy.core.engine] INFO: Spider opened 2021-01-19 13:41:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2021-01-19 13:41:20 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2021-01-19 13:41:20 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <405 https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Berlin/Berlin/-/2,50-/60,00-/EURO--1000,00>: HTTP status code is not handled or not allowed 2021-01-19 13:41:20 [scrapy.core.engine] INFO: Closing spider (finished) 2021-01-19 13:41:20 [immospider.extensions] INFO: No new items found. No email sent. 2021-01-19 13:41:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 910, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 18012, 'downloader/response_count': 2, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/405': 1, 'elapsed_time_seconds': 0.256602, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2021, 1, 19, 12, 41, 20, 624024), 'httperror/response_ignored_count': 1, 'httperror/response_ignored_status_count/405': 1, 'log_count/INFO': 12, 'memusage/max': 57483264, 'memusage/startup': 57483264, 'response_received_count': 2, 'robotstxt/request_count': 1, 'robotstxt/response_count': 1, 'robotstxt/response_status_count/200': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2021, 1, 19, 12, 41, 20, 367422)} 2021-01-19 13:41:20 [scrapy.core.engine] INFO: Spider closed (finished)

sajjadakram2018 commented 3 years ago

I want to mention that I try it with "", but still the same

scrapy crawl immoscout -o apartments.csv -a url="https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Berlin/Berlin/-/2,50-/60,00-/EURO--1000,00" -L INFO

BTW your tool is super cool

asmaier commented 3 years ago

You retrieve an HTTP error 405. The reason is that Immoscout now uses captchas to protect from scraping their website. At the moment there is no workaround for this issue, see also https://github.com/asmaier/ImmoSpider/issues/9