canhduong28 / linkedpy

Simple LinkedIn jobs crawler using Redis-based Scrapy
16 stars 11 forks source link

Initial URLS #1

Closed kaisarea closed 8 years ago

kaisarea commented 11 years ago

First I would like to say thank you this is an amazing program!

I have a question.

Do I need to provide some URL addresses of LinkedIn profiles to initialize this application? I have run this, mysql set up, login seems to have proceeded successfully. I run:

$ scrapy crawl linkedin -a login=True

and get the following output but no profiles are saved in the mysql linked_profiles table.

2013-11-05 14:48:03-0800 [scrapy] INFO: Scrapy 0.18.4 started (bot: linkedpy) 2013-11-05 14:48:03-0800 [scrapy] DEBUG: Optional features available: ssl, http11, libxml2 2013-11-05 14:48:03-0800 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'linkedpy.spiders', 'REDIRECT_MAX_TIMES': 10000, 'CONCURRENT_REQUESTS_PER_DOMAIN': 100, 'SPIDER_MODULES': ['linkedpy.spiders'], 'BOT_NAME': 'linkedpy', 'DOWNLOAD_DELAY': 2} 2013-11-05 14:48:03-0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 2013-11-05 14:48:03-0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, RandomUserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2013-11-05 14:48:03-0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2013-11-05 14:48:03-0800 [scrapy] DEBUG: Enabled item pipelines: 2013-11-05 14:48:03-0800 [linkedin] INFO: Spider opened 2013-11-05 14:48:03-0800 [linkedin] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2013-11-05 14:48:03-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 2013-11-05 14:48:03-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 2013-11-05 14:48:03-0800 [linkedin] DEBUG: Crawled (200) <GET https://www.linkedin.com/uas/login> (referer: None) {'password': 'password', 'key': 'my_email@domain.com'} 2013-11-05 14:48:06-0800 [linkedin] DEBUG: Redirecting (302) to <GET http://www.linkedin.com/nhome/?trk=> from <POST https://www.linkedin.com/uas/login-submit> 2013-11-05 14:48:09-0800 [linkedin] DEBUG: Crawled (200) <GET http://www.linkedin.com/nhome/?trk=> (referer: https://www.linkedin.com/uas/login) 2013-11-05 14:48:09-0800 [scrapy] INFO: Login successful!!! 2013-11-05 14:48:09-0800 [scrapy] INFO: No work available yet, Mission completed... 2013-11-05 14:48:11-0800 [linkedin] DEBUG: Redirecting (301) to <GET https://my.linkedin.com> from <GET http://my.linkedin.com> 2013-11-05 14:48:11-0800 [linkedin] DEBUG: Crawled (200) <GET https://my.linkedin.com> (referer: None) 2013-11-05 14:48:11-0800 [scrapy] INFO: Parsing urls from https://my.linkedin.com 2013-11-05 14:48:11-0800 [linkedin] INFO: Closing spider (finished) 2013-11-05 14:48:11-0800 [linkedin] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 2694, 'downloader/request_count': 5, 'downloader/request_method_count/GET': 4, 'downloader/request_method_count/POST': 1, 'downloader/response_bytes': 104035, 'downloader/response_count': 5, 'downloader/response_status_count/200': 3, 'downloader/response_status_count/301': 1, 'downloader/response_status_count/302': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2013, 11, 5, 22, 48, 11, 554699), 'log_count/DEBUG': 11, 'log_count/INFO': 6, 'request_depth_max': 1, 'response_received_count': 3, 'scheduler/dequeued': 5, 'scheduler/dequeued/memory': 5, 'scheduler/enqueued': 5, 'scheduler/enqueued/memory': 5, 'start_time': datetime.datetime(2013, 11, 5, 22, 48, 3, 618784)} 2013-11-05 14:48:11-0800 [linkedin] INFO: Spider closed (finished)

canhduong28 commented 11 years ago

Hi @kaisarea

I'm glad to hear from you about the linkedpy program. But unfortunately, I wrote this one a year ago. LinkedIn pages have been changed significantly after time. That's why my Scrapy spider is currently failing to extract directories, and profiles as well. So you have nothing in your MySQL. That would make sense.