MMihut / webcrawler

Apache License 2.0
0 stars 1 forks source link

CODE - application ligne de commande en python #14

Open MMihut opened 9 years ago

MMihut commented 9 years ago

Ecrire ou recuperer le code pour une application de type ligne de commande pour le codage/encodage de Web Crawler.

MMihut commented 9 years ago

http://williamjturkel.net/2013/09/29/writing-a-simple-web-spider-using-command-line-tools-in-linux/

import sys import re import urllib2 import urlparse tocrawl = set(["http://www.facebook.com/"]) crawled = set([]) keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent="\'["\']\s/>') linkregex = re.compile('<a\shref=[\'|"](.?)[\'"].*?>') while 1: try: crawling = tocrawl.pop() print crawling except KeyError: raise StopIteration url = urlparse.urlparse(crawling) try: response = urllib2.urlopen(crawling) except: continue msg = response.read() startPos = msg.find('') if startPos != -1: endPos = msg.find('', startPos+7) if endPos != -1: title = msg[startPos+7:endPos] print title keywordlist = keywordregex.findall(msg) if len(keywordlist) > 0: keywordlist = keywordlist[0] keywordlist = keywordlist.split(", ") print keywordlist links = linkregex.findall(msg) crawled.add(crawling) for link in (links.pop(0) for _ in xrange(len(links))): if link.startswith('/'): link = 'http://' + url[1] + link elif link.startswith('#'): link = 'http://' + url[1] + url[2] + link elif not link.startswith('http'): link = 'http://' + url[1] + '/' + link if link not in crawled: tocrawl.add(link)

MMihut commented 9 years ago

class MySpider(BaseSpider):

name = 'my_spider'    

def __init__(self, *args, **kwargs): 
  super(MySpider, self).__init__(*args, **kwargs) 

  self.start_urls = [kwargs.get('start_url')]