待解决的问题 - Githubissues

huzhx commented 10 years ago

周二OH问清： 1）查明所需数据量：几万 2）Lucene用户界面 3）Output的结构

周二前需完成： 1）Scrapy导入robots.txt (http://doc.scrapy.org/en/latest/topics/spiders.html) 2）滤掉图像: 可在parse方程中加入 sel.xpath('//a[substring(@href, string-length(@href)-3)!=".png"][substring(@href,string-length(@href)-3)!=".jpg"][substring(@href, string-length(@href)-3)!=".jpeg"][substring(@href, string-length(@href)-3)!=".gif"][substring(@href, string-length(@href)-3)!=".bmp"]/@href').extract()

huzhx commented 10 years ago

我导入sitemap的方法是在class的argument中加入SitemapSpider，在sitemap_follow加入具体的sitemap，例如： class Wikiachamps4(SitemapSpider,CrawlSpider): sitemap_urls = ['http://leagueoflegends.wikia.com/sitemap-index.xml'] sitemap_follow = ['/sitemap-leagueoflegends-NS_0-0'] name = 'links4' rules = [Rule(SgmlLinkExtractor(allow=(r"http://leagueoflegends.wikia.com/wiki/*")), 'parse_torrent', follow=True)] 试了一下，感觉可以。

接下来，我打算通过xpath把图像滤掉。我的parse方程是： def parse_torrent(self, response): sel = Selector(response) all_links = sel.xpath('//a[substring(@href,string-length(@href)-3)!=".png"][substring(@href,string-length(@href)-3)!=".jpg"][substring(@href, string-length(@href)-3)!=".jpeg"][substring(@href, string-length(@href)-3)!=".gif"][substring(@href, string-length(@href)-3)!=".bmp"]/@href').extract() abs_links = [urljoin_rfc(get_base_url(response),x) for x in all_links] torrent = TorrentItem() torrent['url'] = response.url for i in abs_links: torrent['url'] = i yield torrent

但结果中还有图像，于是我把sitemap去掉，只用CrawlSpider，结果还有图像。似乎xpath的方法没起作用，但我在scrapy shell中测试时，这个方法挺效果不错的。

huzhx commented 10 years ago

问题：

sitemap效果一般
图像没有滤掉

UmichEECS499ProjectTeam / LOL_Crawler

待解决的问题 #3