Closed Glassback closed 1 year ago
Hey @Glassback, are you able to screenshot the page shown by the browser? You can access it using the shortcut on the Makefile: make view (assuming you have vinagre installed)
Here we are, I'll make you know
so I debugged and the step below in extracts_see_all_url is returning none and then it gives not found error. s
ee_all_elem = get_by_xpath(driver, see_all_xpath)
The accompanying data is: see_all_xpath = f'//*[starts-with(text(),"{SEE_ALL_PLACEHOLDER}")]' SEE_ALL_PLACEHOLDER = 'See all'
Hey, @pwalimbe thanks for commenting. yes, the extracts_see_all_url method needs to be fixed. I suppose that is a matter of updating the XPath query that searches for the "See all" button. Maybe in your language, is there a different placeholder? or simply Linkedin has updated the page. Let me know any additional thoughts. Regards
it should be fixed now
Thanks for your amazing job! I'm trying to use your scraper but it didn't works... It redirect to a 404 page.... Can you help me?
`scrapy crawl companies -a selenium_hostname=localhost -o output.csv INFO:scrapy.utils.log:Scrapy 2.0.1 started (bot: scrapybot) 2020-04-19 12:10:18 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: scrapybot) INFO:scrapy.utils.log:Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 2.9, Platform Linux-5.0.0-23-generic-x86_64-with-Ubuntu-18.04-bionic 2020-04-19 12:10:18 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 2.9, Platform Linux-5.0.0-23-generic-x86_64-with-Ubuntu-18.04-bionic DEBUG:scrapy.utils.log:Using reactor: twisted.internet.epollreactor.EPollReactor 2020-04-19 12:10:18 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor INFO:scrapy.crawler:Overridden settings: {'AUTOTHROTTLE_DEBUG': True, 'COOKIES_ENABLED': False, 'DEPTH_PRIORITY': -1, 'DOWNLOAD_DELAY': 0.25, 'FEED_FORMAT': 'csv', 'FEED_URI': 'output.csv', 'NEWSPIDER_MODULE': 'linkedin.spiders', 'SPIDER_MODULES': ['linkedin.spiders'], 'TELNETCONSOLE_ENABLED': False} 2020-04-19 12:10:18 [scrapy.crawler] INFO: Overridden settings: {'AUTOTHROTTLE_DEBUG': True, 'COOKIES_ENABLED': False, 'DEPTH_PRIORITY': -1, 'DOWNLOAD_DELAY': 0.25, 'FEED_FORMAT': 'csv', 'FEED_URI': 'output.csv', 'NEWSPIDER_MODULE': 'linkedin.spiders', 'SPIDER_MODULES': ['linkedin.spiders'], 'TELNETCONSOLE_ENABLED': False} INFO:scrapy.middleware:Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2020-04-19 12:10:18 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] DEBUG:linkedin_api.client:Attempting to use cached cookies 2020-04-19 12:10:18 [linkedin_api.client] DEBUG: Attempting to use cached cookies Initializing chromium, remote url: http://localhost:4444/wd/hub ^CINFO:scrapy.crawler:Received SIGINT, shutting down gracefully. Send again to force 2020-04-19 12:10:18 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force ^CINFO:scrapy.crawler:Received SIGINT twice, forcing unclean shutdown 2020-04-19 12:10:19 [scrapy.crawler] INFO: Received SIGINT twice, forcing unclean shutdown ^CSearching for the Login btn Searching for the password btn Unhandled error in Deferred: CRITICAL:twisted:Unhandled error in Deferred: 2020-04-19 12:10:22 [twisted] CRITICAL: Unhandled error in Deferred:
Traceback (most recent call last): File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 177, in crawl return self._crawl(crawler, *args, kwargs) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 181, in _crawl d = crawler.crawl(*args, *kwargs) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator return _cancellableInlineCallbacks(gen) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks _inlineCallbacks(None, g, status) --- ---
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 88, in crawl
self.spider = self._create_spider( args, kwargs)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 100, in _create_spider
return self.spidercls.from_crawler(self, *args, *kwargs)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/spiders/init.py", line 49, in from_crawler
spider = cls(args, *kwargs)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 39, in init
self.cookies = login(driver)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 126, in login
get_by_xpath(driver, '//[@id="password"]').send_keys(PASSWORD)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 87, in get_by_xpath
(By.XPATH, xpath)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/wait.py", line 71, in until
value = method(self._driver)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 64, in call
return _find_element(driver, self.locator)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 415, in _find_element
raise e
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 411, in _find_element
return driver.find_element(*by)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 978, in find_element
'value': value})['value']
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
(Session info: chrome=81.0.4044.92)
CRITICAL:twisted: Traceback (most recent call last): File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks result = g.send(result) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 88, in crawl self.spider = self._create_spider(*args, kwargs) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 100, in _create_spider return self.spidercls.from_crawler(self, *args, *kwargs) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/spiders/init.py", line 49, in from_crawler spider = cls(args, kwargs) File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 39, in init self.cookies = login(driver) File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 126, in login get_by_xpath(driver, '//[@id="password"]').send_keys(PASSWORD) File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 87, in get_by_xpath (By.XPATH, xpath) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/wait.py", line 71, in until value = method(self._driver) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 64, in call return _find_element(driver, self.locator) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 415, in _find_element raise e File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 411, in _find_element return driver.find_element(by) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 978, in find_element 'value': value})['value'] File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute self.error_handler.check_response(response) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.NoSuchWindowException: Message: no such window: target window already closed from unknown error: web view not found (Session info: chrome=81.0.4044.92)
2020-04-19 12:10:22 [twisted] CRITICAL: Traceback (most recent call last): File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks result = g.send(result) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 88, in crawl self.spider = self._create_spider(*args, kwargs) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 100, in _create_spider return self.spidercls.from_crawler(self, *args, *kwargs) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/spiders/init.py", line 49, in from_crawler spider = cls(args, kwargs) File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 39, in init self.cookies = login(driver) File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 126, in login get_by_xpath(driver, '//[@id="password"]').send_keys(PASSWORD) File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 87, in get_by_xpath (By.XPATH, xpath) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/wait.py", line 71, in until value = method(self._driver) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 64, in call return _find_element(driver, self.locator) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 415, in _find_element raise e File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 411, in _find_element return driver.find_element(by) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 978, in find_element 'value': value})['value'] File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute self.error_handler.check_response(response) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.NoSuchWindowException: Message: no such window: target window already closed from unknown error: web view not found (Session info: chrome=81.0.4044.92)
(.venv) root@glassback-virtual-machine:/home/glassback/linkedin# sudo scrapy crawl companies -a selenium_hostname=localhost -o output.csv sudo: scrapy: command not found (.venv) root@glassback-virtual-machine:/home/glassback/linkedin# scrapy crawl companies -a selenium_hostname=localhost -o output.csv INFO:scrapy.utils.log:Scrapy 2.0.1 started (bot: scrapybot) 2020-04-19 12:10:43 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: scrapybot) INFO:scrapy.utils.log:Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 2.9, Platform Linux-5.0.0-23-generic-x86_64-with-Ubuntu-18.04-bionic 2020-04-19 12:10:43 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 2.9, Platform Linux-5.0.0-23-generic-x86_64-with-Ubuntu-18.04-bionic DEBUG:scrapy.utils.log:Using reactor: twisted.internet.epollreactor.EPollReactor 2020-04-19 12:10:43 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor INFO:scrapy.crawler:Overridden settings: {'AUTOTHROTTLE_DEBUG': True, 'COOKIES_ENABLED': False, 'DEPTH_PRIORITY': -1, 'DOWNLOAD_DELAY': 0.25, 'FEED_FORMAT': 'csv', 'FEED_URI': 'output.csv', 'NEWSPIDER_MODULE': 'linkedin.spiders', 'SPIDER_MODULES': ['linkedin.spiders'], 'TELNETCONSOLE_ENABLED': False} 2020-04-19 12:10:43 [scrapy.crawler] INFO: Overridden settings: {'AUTOTHROTTLE_DEBUG': True, 'COOKIES_ENABLED': False, 'DEPTH_PRIORITY': -1, 'DOWNLOAD_DELAY': 0.25, 'FEED_FORMAT': 'csv', 'FEED_URI': 'output.csv', 'NEWSPIDER_MODULE': 'linkedin.spiders', 'SPIDER_MODULES': ['linkedin.spiders'], 'TELNETCONSOLE_ENABLED': False} INFO:scrapy.middleware:Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2020-04-19 12:10:43 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] DEBUG:linkedin_api.client:Attempting to use cached cookies 2020-04-19 12:10:43 [linkedin_api.client] DEBUG: Attempting to use cached cookies Initializing chromium, remote url: http://localhost:4444/wd/hub Searching for the Login btn Searching for the password btn Searching for the submit INFO:scrapy.middleware:Enabled downloader middlewares: ['scrapy.downloadermiddlewares.stats.DownloaderStats', 'linkedin.middlewares.SeleniumDownloaderMiddleware'] 2020-04-19 12:11:05 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.stats.DownloaderStats', 'linkedin.middlewares.SeleniumDownloaderMiddleware'] INFO:scrapy.middleware:Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2020-04-19 12:11:05 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] INFO:scrapy.middleware:Enabled item pipelines: [] 2020-04-19 12:11:05 [scrapy.middleware] INFO: Enabled item pipelines: [] INFO:scrapy.core.engine:Spider opened 2020-04-19 12:11:05 [scrapy.core.engine] INFO: Spider opened INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2020-04-19 12:11:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) Initializing chromium, remote url: http://localhost:4444/wd/hub ERROR:scrapy.core.scraper:Error downloading <GET https://www.linkedin.com/company/twitter> Traceback (most recent call last): File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks result = g.send(result) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 36, in process_request response = yield deferred_from_coro(method(request=request, spider=spider)) File "/home/glassback/linkedin/linkedin/middlewares.py", line 12, in process_request driver = init_chromium(spider.selenium_hostname, cookies) File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 109, in init_chromium driver.add_cookie(cookie) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 894, in add_cookie self.execute(Command.ADD_COOKIE, {'cookie': cookie_dict}) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute self.error_handler.check_response(response) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: invalid 'expiry' (Session info: chrome=81.0.4044.92)
2020-04-19 12:11:08 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.linkedin.com/company/twitter> Traceback (most recent call last): File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks result = g.send(result) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 36, in process_request response = yield deferred_from_coro(method(request=request, spider=spider)) File "/home/glassback/linkedin/linkedin/middlewares.py", line 12, in process_request driver = init_chromium(spider.selenium_hostname, cookies) File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 109, in init_chromium driver.add_cookie(cookie) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 894, in add_cookie self.execute(Command.ADD_COOKIE, {'cookie': cookie_dict}) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute self.error_handler.check_response(response) File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: invalid 'expiry' (Session info: chrome=81.0.4044.92)
INFO:scrapy.core.engine:Closing spider (finished) 2020-04-19 12:11:08 [scrapy.core.engine] INFO: Closing spider (finished) INFO:scrapy.statscollectors:Dumping Scrapy stats: {'downloader/exception_count': 1, 'downloader/exception_type_count/selenium.common.exceptions.InvalidArgumentException': 1, 'downloader/request_bytes': 57, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'elapsed_time_seconds': 2.54757, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2020, 4, 19, 19, 11, 8, 390260), 'log_count/DEBUG': 1, 'log_count/ERROR': 1, 'log_count/INFO': 8, 'memusage/max': 59977728, 'memusage/startup': 59977728, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2020, 4, 19, 19, 11, 5, 842690)} 2020-04-19 12:11:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/exception_count': 1, 'downloader/exception_type_count/selenium.common.exceptions.InvalidArgumentException': 1, 'downloader/request_bytes': 57, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'elapsed_time_seconds': 2.54757, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2020, 4, 19, 19, 11, 8, 390260), 'log_count/DEBUG': 1, 'log_count/ERROR': 1, 'log_count/INFO': 8, 'memusage/max': 59977728, 'memusage/startup': 59977728, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2020, 4, 19, 19, 11, 5, 842690)} INFO:scrapy.core.engine:Spider closed (finished) 2020-04-19 12:11:08 [scrapy.core.engine] INFO: Spider closed (finished)`