Open maltokyo opened 4 years ago
Hello @maltokyo, I believe your issue should be fixed with the most recent pull request. The FormRequest will no longer be yielded as a list instance and you should be able to login so long as you provide the proper credentials and login url.
Thank you so much. I'll give it a try!
Hello @Dascienz - it took me some time to get back onto it, apologies
I still get an empty csv file when running the latest version, not sure if you can see what I do wrong from this log?
$ scrapy crawl phpBB -o posts.csv
2020-09-27 11:24:28 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: phpBB_scraper)
2020-09-27 11:24:28 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (default, Jul 21 2020, 10:48:26) - [Clang 11.0.3 (clang-1103.0.32.62)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020), cryptography 3.1.1, Platform macOS-10.15.6-x86_64-i386-64bit
2020-09-27 11:24:28 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-09-27 11:24:28 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'phpBB_scraper',
'DOWNLOAD_DELAY': 3.0,
'EDITOR': 'vim',
'NEWSPIDER_MODULE': 'phpBB_scraper.spiders',
'SPIDER_MODULES': ['phpBB_scraper.spiders'],
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
'like Gecko) Chrome/58.0.3029.110 Safari/537.36 '
'OPR/45.0.2552.888'}
2020-09-27 11:24:28 [scrapy.extensions.telnet] INFO: Telnet Password: REMOVED
2020-09-27 11:24:28 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2020-09-27 11:24:28 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-09-27 11:24:28 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-09-27 11:24:28 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-09-27 11:24:28 [scrapy.core.engine] INFO: Spider opened
2020-09-27 11:24:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-09-27 11:24:28 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-09-27 11:24:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://subdomain.mydomain.com/> (referer: None)
2020-09-27 11:24:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://subdomain.mydomain.com/viewforum.php?f=1> (referer: None)
2020-09-27 11:24:37 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://subdomain.mydomain.com/index.php?sid=2dae8c8049be51a302118dda894ebffb> from <POST https://subdomain.mydomain.com/ucp.php?mode=login&sid=99654ccddd31f466435b89497e24c349>
2020-09-27 11:24:41 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://subdomain.mydomain.com/viewforum.php?f=1&sid=4ea697bbb6ad2c6c1ef24604c611c0ed> from <POST https://subdomain.mydomain.com/ucp.php?mode=login&sid=dbb50977787865c6b4a60ce75d34235e>
2020-09-27 11:24:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://subdomain.mydomain.com/index.php?sid=2dae8c8049be51a302118dda894ebffb> (referer: https://subdomain.mydomain.com/)
2020-09-27 11:24:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://subdomain.mydomain.com/viewforum.php?f=1&sid=4ea697bbb6ad2c6c1ef24604c611c0ed> (referer: https://subdomain.mydomain.com/viewforum.php?f=1)
2020-09-27 11:24:49 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-27 11:24:49 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 3456,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 4,
'downloader/request_method_count/POST': 2,
'downloader/response_bytes': 15247,
'downloader/response_count': 6,
'downloader/response_status_count/200': 4,
'downloader/response_status_count/302': 2,
'elapsed_time_seconds': 20.191649,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 9, 27, 9, 24, 49, 147844),
'log_count/DEBUG': 6,
'log_count/INFO': 10,
'memusage/max': 50814976,
'memusage/startup': 50814976,
'request_depth_max': 1,
'response_received_count': 4,
'scheduler/dequeued': 6,
'scheduler/dequeued/memory': 6,
'scheduler/enqueued': 6,
'scheduler/enqueued/memory': 6,
'start_time': datetime.datetime(2020, 9, 27, 9, 24, 28, 956195)}
2020-09-27 11:24:49 [scrapy.core.engine] INFO: Spider closed (finished)
@maltokyo Is there any chance you can link to an example forum with similar results? It's hard to tell what the issue is from the logs alone, but looks like the crawler isn't parsing any pages.
Sure! I dont mind to send. Could I please share with you in email?
@Dascienz please let me know how I can privately share with you, thank you!
Thanks for sharing cool script! I am trying to login to my own PHPBB and scrape everything in text to migrate to bookstack, well, thats the plan. But I got this error (in title). Please advise.
PHPBB is latest version running on https.
phpBB.py (all my details removed, but I confirm I could login properly with them - wrong format?):
Full log: