终于看见了一个我能看得懂的爬虫。。。虽然有bug

anmingyu11 commented 4 years ago

 posts = selector.xpath('//div[@class="articleh normal_post"]')  # + selector.xpath('//div[@class="articleh odd"]')

        for index, post in enumerate(posts):
            link = post.xpath('span[@class="l3 a3"]/a/@href').extract()
            if link:
                if link[0].startswith('/'):
                    link = "http://guba.eastmoney.com/" + link[0][1:]
                else:
                    link = "http://guba.eastmoney.com/" + link[0]

                if link in self._existed_urls:
                    continue

            # drop set-top or ad post
            type = post.xpath('span[@class="l3 a3"]/em/@class').extract()
            if type:
                type = type[0]
                if type == 'ad' or type == 'settop' or type == 'hinfo':
                    continue
            else:
                type = 'normal'

            read_count = post.xpath('span[@class="l1 a1"]/text()').extract()
            comment_count = post.xpath('span[@class="l2 a2"]/text()').extract()
            username = post.xpath('span[@class="l4 a4"]/a/font/text()').extract()
            updated_time = post.xpath('span[@class="l5 a5"]/text()').extract()
            print('read_count:', read_count)
            print('comment_count:', comment_count)
            print('username:', username)
            print('updated_time:', updated_time)
            if not read_count or not comment_count or not username or not updated_time:
                print('break')
                continue

            item = PostItem()
            item['stock_id'] = stock_id
            item['read_count'] = int(read_count[0])
            item['comment_count'] = int(comment_count[0])
            item['username'] = username[0].strip('\r\n').strip()
            item['updated_time'] = updated_time[0]
            item['url'] = link

            if link:
                yield Request(url=link, meta={'item': item, 'PhantomJS': True}, callback=self.parse_post)

        if page < self.total_pages:
            stock_id = self.stock_id
            request = Request(LIST_URL.format(stock_id=self.stock_id, page=page + 1))
            request.meta['stock_id'] = stock_id
            request.meta['page'] = page + 1
            yield request

东方股吧的标签变了，而且你用的LIST_URL也有些问题，目前看来只有上证指数是用的你这里些的LISTURL的格式，我试了下沪深三百，LISTURL不一样，还得做特殊处理。

ZHANGM41 commented 4 years ago

你好这个爬个股吧有bug嘛，我尝试之后存不到数据库上去 ...

anmingyu11 commented 4 years ago

你好这个爬个股吧有bug嘛，我尝试之后存不到数据库上去 ...

@ZHANGM41

这个现在爬不了，你得改，因为股吧的页面结构变了。

ZHANGM41 commented 4 years ago

你好这个爬个股吧有bug嘛，我尝试之后存不到数据库上去 ...

@ZHANGM41

这个现在爬不了，你得改，因为股吧的页面结构变了。

谢谢！我改了之后发现好像有反爬，爬了一阵就重新到另一个无关网页了…不知道这个能换ip解决嘛？

anmingyu11 commented 4 years ago

你好这个爬个股吧有bug嘛，我尝试之后存不到数据库上去 ...

@ZHANGM41 这个现在爬不了，你得改，因为股吧的页面结构变了。

谢谢！我改了之后发现好像有反爬，爬了一阵就重新到另一个无关网页了…不知道这个能换ip解决嘛？

@ZHANGM41

你说的没错，有反爬，我用的付费ip代理爬的，单机爬是不可以的。

ZHANGM41 commented 4 years ago

你好这个爬个股吧有bug嘛，我尝试之后存不到数据库上去 ...

@ZHANGM41 这个现在爬不了，你得改，因为股吧的页面结构变了。

谢谢！我改了之后发现好像有反爬，爬了一阵就重新到另一个无关网页了…不知道这个能换ip解决嘛？

@ZHANGM41

你说的没错，有反爬，我用的付费ip代理爬的，单机爬是不可以的。

好的非常感谢!

shizhu13 commented 3 years ago

你好这个爬个股吧有bug嘛，我尝试之后存不到数据库上去 ...

@ZHANGM41 这个现在爬不了，你得改，因为股吧的页面结构变了。

谢谢！我改了之后发现好像有反爬，爬了一阵就重新到另一个无关网页了…不知道这个能换ip解决嘛？

@ZHANGM41

你说的没错，有反爬，我用的付费ip代理爬的，单机爬是不可以的。

环境配置是什么？能分享一下吗？希望大家可以加个微信互相讨论，我会建个群，大家专门讨论爬虫的

shizhu13 commented 3 years ago

希望大家可以加个微信互相讨论，我会建个群，大家专门讨论爬虫的，我的微信： 876983033

c976237222 commented 1 year ago

希望大家可以加个微信互相讨论，我会建个群，大家专门讨论爬虫的，我的微信： 876983033

同学您这个问题解决了吗还方便加微信吗

c976237222 commented 1 year ago

你好这个爬个股吧有bug嘛，我尝试之后存不到数据库上去 ...

@ZHANGM41 这个现在爬不了，你得改，因为股吧的页面结构变了。

谢谢！我改了之后发现好像有反爬，爬了一阵就重新到另一个无关网页了…不知道这个能换ip解决嘛？

@ZHANGM41 你说的没错，有反爬，我用的付费ip代理爬的，单机爬是不可以的。

好的非常感谢!

同学您还有可以使用的代码吗

algosenses / EastMoneySpider

终于看见了一个我能看得懂的爬虫。。。虽然有bug #2