eracle / linkedin

Linkedin Scraper using Selenium Web Driver, Chromium headless, Docker and Scrapy
Other
747 stars 122 forks source link

Page not found on LinkedIn #59

Closed Glassback closed 1 year ago

Glassback commented 4 years ago

Thanks for your amazing job! I'm trying to use your scraper but it didn't works... It redirect to a 404 page.... Can you help me?

`scrapy crawl companies -a selenium_hostname=localhost -o output.csv INFO:scrapy.utils.log:Scrapy 2.0.1 started (bot: scrapybot) 2020-04-19 12:10:18 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: scrapybot) INFO:scrapy.utils.log:Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 2.9, Platform Linux-5.0.0-23-generic-x86_64-with-Ubuntu-18.04-bionic 2020-04-19 12:10:18 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 2.9, Platform Linux-5.0.0-23-generic-x86_64-with-Ubuntu-18.04-bionic DEBUG:scrapy.utils.log:Using reactor: twisted.internet.epollreactor.EPollReactor 2020-04-19 12:10:18 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor INFO:scrapy.crawler:Overridden settings: {'AUTOTHROTTLE_DEBUG': True, 'COOKIES_ENABLED': False, 'DEPTH_PRIORITY': -1, 'DOWNLOAD_DELAY': 0.25, 'FEED_FORMAT': 'csv', 'FEED_URI': 'output.csv', 'NEWSPIDER_MODULE': 'linkedin.spiders', 'SPIDER_MODULES': ['linkedin.spiders'], 'TELNETCONSOLE_ENABLED': False} 2020-04-19 12:10:18 [scrapy.crawler] INFO: Overridden settings: {'AUTOTHROTTLE_DEBUG': True, 'COOKIES_ENABLED': False, 'DEPTH_PRIORITY': -1, 'DOWNLOAD_DELAY': 0.25, 'FEED_FORMAT': 'csv', 'FEED_URI': 'output.csv', 'NEWSPIDER_MODULE': 'linkedin.spiders', 'SPIDER_MODULES': ['linkedin.spiders'], 'TELNETCONSOLE_ENABLED': False} INFO:scrapy.middleware:Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2020-04-19 12:10:18 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] DEBUG:linkedin_api.client:Attempting to use cached cookies 2020-04-19 12:10:18 [linkedin_api.client] DEBUG: Attempting to use cached cookies Initializing chromium, remote url: http://localhost:4444/wd/hub ^CINFO:scrapy.crawler:Received SIGINT, shutting down gracefully. Send again to force 2020-04-19 12:10:18 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force ^CINFO:scrapy.crawler:Received SIGINT twice, forcing unclean shutdown 2020-04-19 12:10:19 [scrapy.crawler] INFO: Received SIGINT twice, forcing unclean shutdown ^CSearching for the Login btn Searching for the password btn Unhandled error in Deferred: CRITICAL:twisted:Unhandled error in Deferred: 2020-04-19 12:10:22 [twisted] CRITICAL: Unhandled error in Deferred:

Traceback (most recent call last): File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 177, in crawl return self._crawl(crawler, *args, kwargs) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 181, in _crawl d = crawler.crawl(*args, *kwargs) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator return _cancellableInlineCallbacks(gen) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks _inlineCallbacks(None, g, status) --- --- File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks result = g.send(result) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 88, in crawl self.spider = self._create_spider(args, kwargs) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 100, in _create_spider return self.spidercls.from_crawler(self, *args, *kwargs) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/spiders/init.py", line 49, in from_crawler spider = cls(args, *kwargs) File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 39, in init self.cookies = login(driver) File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 126, in login get_by_xpath(driver, '//[@id="password"]').send_keys(PASSWORD) File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 87, in get_by_xpath (By.XPATH, xpath) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/wait.py", line 71, in until value = method(self._driver) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 64, in call return _find_element(driver, self.locator) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 415, in _find_element raise e File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 411, in _find_element return driver.find_element(*by) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 978, in find_element 'value': value})['value'] File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute self.error_handler.check_response(response) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.NoSuchWindowException: Message: no such window: target window already closed from unknown error: web view not found (Session info: chrome=81.0.4044.92)

CRITICAL:twisted: Traceback (most recent call last): File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks result = g.send(result) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 88, in crawl self.spider = self._create_spider(*args, kwargs) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 100, in _create_spider return self.spidercls.from_crawler(self, *args, *kwargs) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/spiders/init.py", line 49, in from_crawler spider = cls(args, kwargs) File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 39, in init self.cookies = login(driver) File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 126, in login get_by_xpath(driver, '//[@id="password"]').send_keys(PASSWORD) File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 87, in get_by_xpath (By.XPATH, xpath) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/wait.py", line 71, in until value = method(self._driver) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 64, in call return _find_element(driver, self.locator) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 415, in _find_element raise e File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 411, in _find_element return driver.find_element(by) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 978, in find_element 'value': value})['value'] File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute self.error_handler.check_response(response) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.NoSuchWindowException: Message: no such window: target window already closed from unknown error: web view not found (Session info: chrome=81.0.4044.92)

2020-04-19 12:10:22 [twisted] CRITICAL: Traceback (most recent call last): File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks result = g.send(result) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 88, in crawl self.spider = self._create_spider(*args, kwargs) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 100, in _create_spider return self.spidercls.from_crawler(self, *args, *kwargs) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/spiders/init.py", line 49, in from_crawler spider = cls(args, kwargs) File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 39, in init self.cookies = login(driver) File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 126, in login get_by_xpath(driver, '//[@id="password"]').send_keys(PASSWORD) File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 87, in get_by_xpath (By.XPATH, xpath) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/wait.py", line 71, in until value = method(self._driver) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 64, in call return _find_element(driver, self.locator) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 415, in _find_element raise e File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 411, in _find_element return driver.find_element(by) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 978, in find_element 'value': value})['value'] File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute self.error_handler.check_response(response) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.NoSuchWindowException: Message: no such window: target window already closed from unknown error: web view not found (Session info: chrome=81.0.4044.92)

(.venv) root@glassback-virtual-machine:/home/glassback/linkedin# sudo scrapy crawl companies -a selenium_hostname=localhost -o output.csv sudo: scrapy: command not found (.venv) root@glassback-virtual-machine:/home/glassback/linkedin# scrapy crawl companies -a selenium_hostname=localhost -o output.csv INFO:scrapy.utils.log:Scrapy 2.0.1 started (bot: scrapybot) 2020-04-19 12:10:43 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: scrapybot) INFO:scrapy.utils.log:Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 2.9, Platform Linux-5.0.0-23-generic-x86_64-with-Ubuntu-18.04-bionic 2020-04-19 12:10:43 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 2.9, Platform Linux-5.0.0-23-generic-x86_64-with-Ubuntu-18.04-bionic DEBUG:scrapy.utils.log:Using reactor: twisted.internet.epollreactor.EPollReactor 2020-04-19 12:10:43 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor INFO:scrapy.crawler:Overridden settings: {'AUTOTHROTTLE_DEBUG': True, 'COOKIES_ENABLED': False, 'DEPTH_PRIORITY': -1, 'DOWNLOAD_DELAY': 0.25, 'FEED_FORMAT': 'csv', 'FEED_URI': 'output.csv', 'NEWSPIDER_MODULE': 'linkedin.spiders', 'SPIDER_MODULES': ['linkedin.spiders'], 'TELNETCONSOLE_ENABLED': False} 2020-04-19 12:10:43 [scrapy.crawler] INFO: Overridden settings: {'AUTOTHROTTLE_DEBUG': True, 'COOKIES_ENABLED': False, 'DEPTH_PRIORITY': -1, 'DOWNLOAD_DELAY': 0.25, 'FEED_FORMAT': 'csv', 'FEED_URI': 'output.csv', 'NEWSPIDER_MODULE': 'linkedin.spiders', 'SPIDER_MODULES': ['linkedin.spiders'], 'TELNETCONSOLE_ENABLED': False} INFO:scrapy.middleware:Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2020-04-19 12:10:43 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] DEBUG:linkedin_api.client:Attempting to use cached cookies 2020-04-19 12:10:43 [linkedin_api.client] DEBUG: Attempting to use cached cookies Initializing chromium, remote url: http://localhost:4444/wd/hub Searching for the Login btn Searching for the password btn Searching for the submit INFO:scrapy.middleware:Enabled downloader middlewares: ['scrapy.downloadermiddlewares.stats.DownloaderStats', 'linkedin.middlewares.SeleniumDownloaderMiddleware'] 2020-04-19 12:11:05 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.stats.DownloaderStats', 'linkedin.middlewares.SeleniumDownloaderMiddleware'] INFO:scrapy.middleware:Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2020-04-19 12:11:05 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] INFO:scrapy.middleware:Enabled item pipelines: [] 2020-04-19 12:11:05 [scrapy.middleware] INFO: Enabled item pipelines: [] INFO:scrapy.core.engine:Spider opened 2020-04-19 12:11:05 [scrapy.core.engine] INFO: Spider opened INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2020-04-19 12:11:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) Initializing chromium, remote url: http://localhost:4444/wd/hub ERROR:scrapy.core.scraper:Error downloading <GET https://www.linkedin.com/company/twitter> Traceback (most recent call last): File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks result = g.send(result) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 36, in process_request response = yield deferred_from_coro(method(request=request, spider=spider)) File "/home/glassback/linkedin/linkedin/middlewares.py", line 12, in process_request driver = init_chromium(spider.selenium_hostname, cookies) File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 109, in init_chromium driver.add_cookie(cookie) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 894, in add_cookie self.execute(Command.ADD_COOKIE, {'cookie': cookie_dict}) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute self.error_handler.check_response(response) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: invalid 'expiry' (Session info: chrome=81.0.4044.92)

2020-04-19 12:11:08 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.linkedin.com/company/twitter> Traceback (most recent call last): File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks result = g.send(result) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 36, in process_request response = yield deferred_from_coro(method(request=request, spider=spider)) File "/home/glassback/linkedin/linkedin/middlewares.py", line 12, in process_request driver = init_chromium(spider.selenium_hostname, cookies) File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 109, in init_chromium driver.add_cookie(cookie) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 894, in add_cookie self.execute(Command.ADD_COOKIE, {'cookie': cookie_dict}) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute self.error_handler.check_response(response) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: invalid 'expiry' (Session info: chrome=81.0.4044.92)

INFO:scrapy.core.engine:Closing spider (finished) 2020-04-19 12:11:08 [scrapy.core.engine] INFO: Closing spider (finished) INFO:scrapy.statscollectors:Dumping Scrapy stats: {'downloader/exception_count': 1, 'downloader/exception_type_count/selenium.common.exceptions.InvalidArgumentException': 1, 'downloader/request_bytes': 57, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'elapsed_time_seconds': 2.54757, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2020, 4, 19, 19, 11, 8, 390260), 'log_count/DEBUG': 1, 'log_count/ERROR': 1, 'log_count/INFO': 8, 'memusage/max': 59977728, 'memusage/startup': 59977728, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2020, 4, 19, 19, 11, 5, 842690)} 2020-04-19 12:11:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/exception_count': 1, 'downloader/exception_type_count/selenium.common.exceptions.InvalidArgumentException': 1, 'downloader/request_bytes': 57, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'elapsed_time_seconds': 2.54757, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2020, 4, 19, 19, 11, 8, 390260), 'log_count/DEBUG': 1, 'log_count/ERROR': 1, 'log_count/INFO': 8, 'memusage/max': 59977728, 'memusage/startup': 59977728, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2020, 4, 19, 19, 11, 5, 842690)} INFO:scrapy.core.engine:Spider closed (finished) 2020-04-19 12:11:08 [scrapy.core.engine] INFO: Spider closed (finished)`

eracle commented 4 years ago

Hey @Glassback, are you able to screenshot the page shown by the browser? You can access it using the shortcut on the Makefile: make view (assuming you have vinagre installed)

eracle commented 4 years ago

linkein404 Here we are, I'll make you know

pwalimbe commented 3 years ago

so I debugged and the step below in extracts_see_all_url is returning none and then it gives not found error. s

ee_all_elem = get_by_xpath(driver, see_all_xpath)

The accompanying data is: see_all_xpath = f'//*[starts-with(text(),"{SEE_ALL_PLACEHOLDER}")]' SEE_ALL_PLACEHOLDER = 'See all'

eracle commented 3 years ago

Hey, @pwalimbe thanks for commenting. yes, the extracts_see_all_url method needs to be fixed. I suppose that is a matter of updating the XPath query that searches for the "See all" button. Maybe in your language, is there a different placeholder? or simply Linkedin has updated the page. Let me know any additional thoughts. Regards

eracle commented 1 year ago

it should be fixed now