Dascienz / phpBB-forum-scraper

Python-based web crawlers for scraping phpBB forum posts.
37 stars 18 forks source link

ERROR: Spider must return Request, BaseItem, dict or None, got 'list' #5

Open maltokyo opened 4 years ago

maltokyo commented 4 years ago

Thanks for sharing cool script! I am trying to login to my own PHPBB and scrape everything in text to migrate to bookstack, well, thats the plan. But I got this error (in title). Please advise.

PHPBB is latest version running on https.

phpBB.py (all my details removed, but I confirm I could login properly with them - wrong format?):

# -*- coding: utf-8 -*-
import re
import scrapy
from bs4 import BeautifulSoup
from scrapy.http import Request

class PhpbbSpider(scrapy.Spider):

    name = 'phpBB'
    # Domain only, no urls
    allowed_domains = ['kagi.mydomain.com']
    start_urls = ['https://kagi.mydomain.com']
    username = 'MYUSERHERE'
    password = 'REMOVED'
    # False if you don't need to login, True if you do.
    form_login = True

    def parse(self, response):
        # LOGIN TO PHPBB BOARD AND CALL AFTER_LOGIN
        if self.form_login:
            formdata = {'username': self.username, 'password': self.password}
            form_request = [
                scrapy.FormRequest.from_response(
                    response,
                    formdata=formdata,
                    callback=self.after_login,
                    dont_click=True
                )
            ]
            yield form_request
            return
        else:
            # REQUEST SUB-FORUM TITLE LINKS
            links = response.xpath('//a[@class="forumtitle"]/@href').extract()
            for link in links:
                yield scrapy.Request(response.urljoin(link), callback=self.parse_topics)

    def after_login(self, response):
        # CHECK LOGIN SUCCESS BEFORE MAKING REQUESTS
        if b'authentication failed' in response.body:
            self.logger.error('Login failed.')
            return
        else:
            # REQUEST SUB-FORUM TITLE LINKS
            links = response.xpath('//a[@class="forumtitle"]/@href').extract()
            for link in links:
                yield scrapy.Request(response.urljoin(link), callback=self.parse_topics)

    def parse_topics(self, response):
        # REQUEST TOPIC TITLE LINKS
        links = response.xpath('//a[@class="topictitle"]/@href').extract()
        for link in links:
            yield scrapy.Request(response.urljoin(link), callback=self.parse_posts)

        # IF NEXT PAGE EXISTS, FOLLOW
        next_link = response.xpath('//li[@class="next"]//a[@rel="next"]/@href').extract_first()
        if next_link:
            yield scrapy.Request(response.urljoin(next_link), callback=self.parse_topics)   

    def clean_quote(self, string):
        # CLEAN HTML TAGS FROM POST TEXT, MARK QUOTES
        soup = BeautifulSoup(string, 'lxml')
        block_quotes = soup.find_all('blockquote')
        for i, quote in enumerate(block_quotes):
            block_quotes[i] = '<quote-%s>=%s' % (str(i + 1), quote.get_text())
        return ''.join(block_quotes)

    def clean_text(self, string):
        # CLEAN HTML TAGS FROM POST TEXT, MARK REPLIES TO QUOTES
        tags = ['blockquote']
        soup = BeautifulSoup(string, 'lxml')
        for tag in tags:
            for i, item in enumerate(soup.find_all(tag)):
                item.replaceWith('<reply-%s>=' % str(i + 1))
        return re.sub(r' +', r' ', soup.get_text())

    def parse_posts(self, response):
        # COLLECT FORUM POST DATA
        usernames = response.xpath('//p[@class="author"]//a[@class="username"]//text()').extract()
        post_counts = response.xpath('//dd[@class="profile-posts"]//a/text()').extract()
        post_times = response.xpath('//p[@class="author"]/text()').extract()
        post_texts = response.xpath('//div[@class="postbody"]//div[@class="content"]').extract()
        post_quotes = [self.clean_quote(s) for s in post_texts]
        post_texts = [self.clean_text(s) for s in post_texts]

        # YIELD POST DATA
        for i in range(len(usernames)):
            yield {
                'Username': usernames[i],
                'PostCount': post_counts[i],
                'PostTime': post_times[i],
                'PostText': post_texts[i],
                'QuoteText': post_quotes[i]
            }

        # CLICK THROUGH NEXT PAGE
        next_link = response.xpath('//li[@class="next"]//a[@rel="next"]/@href').extract_first()
        if next_link:
            yield scrapy.Request(response.urljoin(next_link), callback=self.parse_posts)

Full log:

scrapy crawl phpBB -o posts.csv
2020-04-06 21:53:49 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: phpBB_scraper)
2020-04-06 21:53:49 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.7.6 (default, Dec 30 2019, 19:38:26) - [Clang 11.0.0 (clang-1100.0.33.16)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f  31 Mar 2020), cryptography 2.9, Platform Darwin-19.4.0-x86_64-i386-64bit
2020-04-06 21:53:49 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-04-06 21:53:49 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'phpBB_scraper',
 'DOWNLOAD_DELAY': 1.0,
 'FEED_FORMAT': 'csv',
 'FEED_URI': 'posts.csv',
 'NEWSPIDER_MODULE': 'phpBB_scraper.spiders',
 'SPIDER_MODULES': ['phpBB_scraper.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
               'like Gecko) Chrome/58.0.3029.110 Safari/537.36 '
               'OPR/45.0.2552.888'}
2020-04-06 21:53:49 [scrapy.extensions.telnet] INFO: Telnet Password: 9c2a836e855b2b8b
2020-04-06 21:53:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2020-04-06 21:53:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-04-06 21:53:49 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-04-06 21:53:49 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-04-06 21:53:49 [scrapy.core.engine] INFO: Spider opened
2020-04-06 21:53:49 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-04-06 21:53:49 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-04-06 21:53:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://kagi.mydomain.com> (referer: None)
2020-04-06 21:53:49 [scrapy.core.scraper] ERROR: Spider must return Request, BaseItem, dict or None, got 'list' in <GET https://kagi.mydomain.com>
2020-04-06 21:53:49 [scrapy.core.engine] INFO: Closing spider (finished)
2020-04-06 21:53:49 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 305,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 3620,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.482378,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 4, 6, 19, 53, 49, 766382),
 'log_count/DEBUG': 1,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'memusage/max': 55996416,
 'memusage/startup': 55992320,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2020, 4, 6, 19, 53, 49, 284004)}
2020-04-06 21:53:49 [scrapy.core.engine] INFO: Spider closed (finished)
Dascienz commented 4 years ago

Hello @maltokyo, I believe your issue should be fixed with the most recent pull request. The FormRequest will no longer be yielded as a list instance and you should be able to login so long as you provide the proper credentials and login url.

maltokyo commented 4 years ago

Thank you so much. I'll give it a try!

maltokyo commented 3 years ago

Hello @Dascienz - it took me some time to get back onto it, apologies

I still get an empty csv file when running the latest version, not sure if you can see what I do wrong from this log?

$ scrapy crawl phpBB -o posts.csv

2020-09-27 11:24:28 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: phpBB_scraper)
2020-09-27 11:24:28 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (default, Jul 21 2020, 10:48:26) - [Clang 11.0.3 (clang-1103.0.32.62)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform macOS-10.15.6-x86_64-i386-64bit
2020-09-27 11:24:28 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-09-27 11:24:28 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'phpBB_scraper',
 'DOWNLOAD_DELAY': 3.0,
 'EDITOR': 'vim',
 'NEWSPIDER_MODULE': 'phpBB_scraper.spiders',
 'SPIDER_MODULES': ['phpBB_scraper.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
               'like Gecko) Chrome/58.0.3029.110 Safari/537.36 '
               'OPR/45.0.2552.888'}
2020-09-27 11:24:28 [scrapy.extensions.telnet] INFO: Telnet Password: REMOVED
2020-09-27 11:24:28 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2020-09-27 11:24:28 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-09-27 11:24:28 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-09-27 11:24:28 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-09-27 11:24:28 [scrapy.core.engine] INFO: Spider opened
2020-09-27 11:24:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-09-27 11:24:28 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-09-27 11:24:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://subdomain.mydomain.com/> (referer: None)
2020-09-27 11:24:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://subdomain.mydomain.com/viewforum.php?f=1> (referer: None)
2020-09-27 11:24:37 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://subdomain.mydomain.com/index.php?sid=2dae8c8049be51a302118dda894ebffb> from <POST https://subdomain.mydomain.com/ucp.php?mode=login&sid=99654ccddd31f466435b89497e24c349>
2020-09-27 11:24:41 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://subdomain.mydomain.com/viewforum.php?f=1&sid=4ea697bbb6ad2c6c1ef24604c611c0ed> from <POST https://subdomain.mydomain.com/ucp.php?mode=login&sid=dbb50977787865c6b4a60ce75d34235e>
2020-09-27 11:24:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://subdomain.mydomain.com/index.php?sid=2dae8c8049be51a302118dda894ebffb> (referer: https://subdomain.mydomain.com/)
2020-09-27 11:24:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://subdomain.mydomain.com/viewforum.php?f=1&sid=4ea697bbb6ad2c6c1ef24604c611c0ed> (referer: https://subdomain.mydomain.com/viewforum.php?f=1)
2020-09-27 11:24:49 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-27 11:24:49 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 3456,
 'downloader/request_count': 6,
 'downloader/request_method_count/GET': 4,
 'downloader/request_method_count/POST': 2,
 'downloader/response_bytes': 15247,
 'downloader/response_count': 6,
 'downloader/response_status_count/200': 4,
 'downloader/response_status_count/302': 2,
 'elapsed_time_seconds': 20.191649,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 9, 27, 9, 24, 49, 147844),
 'log_count/DEBUG': 6,
 'log_count/INFO': 10,
 'memusage/max': 50814976,
 'memusage/startup': 50814976,
 'request_depth_max': 1,
 'response_received_count': 4,
 'scheduler/dequeued': 6,
 'scheduler/dequeued/memory': 6,
 'scheduler/enqueued': 6,
 'scheduler/enqueued/memory': 6,
 'start_time': datetime.datetime(2020, 9, 27, 9, 24, 28, 956195)}
2020-09-27 11:24:49 [scrapy.core.engine] INFO: Spider closed (finished)
Dascienz commented 3 years ago

@maltokyo Is there any chance you can link to an example forum with similar results? It's hard to tell what the issue is from the logs alone, but looks like the crawler isn't parsing any pages.

maltokyo commented 3 years ago

Sure! I dont mind to send. Could I please share with you in email?

maltokyo commented 3 years ago

@Dascienz please let me know how I can privately share with you, thank you!