Closed nicolabertoldi closed 2 years ago
could you provide a minimal working example so that i can reproduce the issue? e.g., a config file (if you were using cli mode) or some lines of codes (for library mode)
I am using the cli.
I just run
news-please
with the following standard config file (I just modified the URL)
# This is a HJSON-File, so comments and so on can be used! See https://hjson.org/
# Furthermore this is first of all the actual config file, but as default just filled with examples.
{
# Every URL has to be in an array-object in "base_urls".
# The same URL in combination with the same crawler may only appear once in this array.
"base_urls" : [
{
# Start crawling from ladigetto.it
"url": "https://ladigetto.it",
# Overwrite the default crawler and use th RecursiveCrawler instead
"crawler": "RecursiveCrawler",
# Because this site is weirt, use the
# meta_contains_article_keyword-heuristic and disable all others because
# overwrite will merge the defaults from "newscrawler.cfg" with
# this
"overwrite_heuristics": {
"meta_contains_article_keyword": true,
"og_type": false,
"linked_headlines": false,
"self_linked_headlines": false
},
# Also state that in the condition, all heuristics used in the condition
# have to be activated in "overwrite_heuristics" (or default) as well.
"pass_heuristics_condition": "meta_contains_article_keyword"
}
]
}
thanks. the stacktrace you posted is pretty short, is it complete or just an excerpt? ie its missing the news-please modules. could you pls post the full stack trace?
The full log
$> news-please
:0: UserWarning: You do not have a working installation of the service_identity module: 'cannot import name 'opentype''. Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied. Without the service_identity module, Twisted can perform only rudimentary TLS client hostname verification. Many valid certificate/hostname mappings may be rejected.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.25.9) or chardet (2.3.0) doesn't match a supported version!
RequestsDependencyWarning)
[newsplease.config:161|INFO] Loading config-file (/home/nicola/news-please-repo/config/config.cfg)
:0: UserWarning: You do not have a working installation of the service_identity module: 'cannot import name 'opentype''. Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied. Without the service_identity module, Twisted can perform only rudimentary TLS client hostname verification. Many valid certificate/hostname mappings may be rejected.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.25.9) or chardet (2.3.0) doesn't match a supported version!
RequestsDependencyWarning)
[newsplease.config:161|INFO] Loading config-file (/home/nicola/news-please-repo/config/config.cfg)
[scrapy.utils.log:146|INFO] Scrapy 2.1.0 started (bot: news-please)
[scrapy.utils.log:149|INFO] Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.8 (default, Dec 24 2018, 19:24:27) - [GCC 5.4.0 20160609], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Linux-4.15.0-47-generic-x86_64-with-Ubuntu-16.04-xenial
[scrapy.crawler:60|INFO] Overridden settings:
{'BOT_NAME': 'news-please',
'CONCURRENT_REQUESTS_PER_DOMAIN': 4,
'LOG_FORMAT': '[%(name)s:%(lineno)d|%(levelname)s] %(message)s',
'LOG_LEVEL': 'INFO',
'NEWSPIDER_MODULE': 'newsplease.crawler.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['newsplease.crawler.spiders'],
'USER_AGENT': 'news-please (+http://www.example.com/)'}
[scrapy.extensions.telnet:55|INFO] Telnet Password: 961215d93e9de0ec
[scrapy.middleware:48|INFO] Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.spiderstate.SpiderState']
[scrapy.middleware:48|INFO] Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
[scrapy.middleware:48|INFO] Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
[newsplease.pipeline.extractor.article_extractor:34|INFO] Extractor initialized: newspaper_extractor
[newsplease.pipeline.extractor.article_extractor:34|INFO] Extractor initialized: readability_extractor
[newsplease.pipeline.extractor.article_extractor:34|INFO] Extractor initialized: date_extractor
[newsplease.pipeline.extractor.article_extractor:34|INFO] Extractor initialized: lang_detect_extractor
[scrapy.middleware:48|INFO] Enabled item pipelines:
['newsplease.pipeline.pipelines.ArticleMasterExtractor',
'newsplease.pipeline.pipelines.HtmlFileStorage',
'newsplease.pipeline.pipelines.JsonFileStorage']
[scrapy.core.engine:268|INFO] Spider opened
[scrapy.extensions.logstats:48|INFO] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[scrapy.extensions.telnet:69|INFO] Telnet console listening on 127.0.0.1:6023
[scrapy.core.downloader.tls:84|WARNING] Remote certificate is not valid for hostname "www.ladigetto.it"; '*.ladigetto.it'!='www.ladigetto.it'
[scrapy.core.scraper:168|ERROR] Spider error processing <GET https://www.ladigetto.it/> (referer: None)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/defer.py", line 117, in iter_errback
yield next(it)
File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/python.py", line 345, in __next__
return next(self.data)
File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/python.py", line 345, in __next__
return next(self.data)
File "/usr/local/lib/python3.6/dist-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/usr/local/lib/python3.6/dist-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/referer.py", line 338, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python3.6/dist-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python3.6/dist-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python3.6/dist-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/usr/local/lib/python3.6/dist-packages/newsplease/crawler/spiders/recursive_crawler.py", line 49, in parse
self.ignore_file_extensions):
File "/usr/local/lib/python3.6/dist-packages/newsplease/helper_classes/parse_crawler.py", line 116, in recursive_requests
for href in response.css("a::attr('href')").extract() if re.match(
File "/usr/local/lib/python3.6/dist-packages/newsplease/helper_classes/parse_crawler.py", line 120, in <listcomp>
and len(re.match(ignore_regex, response.urljoin(href)).group(0)) == 0
File "/usr/local/lib/python3.6/dist-packages/scrapy/http/request/__init__.py", line 25, in __init__
self._set_url(url)
File "/usr/local/lib/python3.6/dist-packages/scrapy/http/request/__init__.py", line 69, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: mailto:redazione@ladigetto.it
[scrapy.core.engine:306|INFO] Closing spider (finished)
[scrapy.statscollectors:47|INFO] Dumping Scrapy stats:
{'downloader/request_bytes': 1802,
'downloader/request_count': 8,
'downloader/request_method_count/GET': 8,
'downloader/response_bytes': 18169,
'downloader/response_count': 8,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/301': 5,
'elapsed_time_seconds': 0.725113,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 5, 31, 15, 40, 57, 72473),
'log_count/ERROR': 1,
'log_count/INFO': 14,
'log_count/WARNING': 1,
'memusage/max': 103735296,
'memusage/startup': 103735296,
'response_received_count': 3,
'robotstxt/request_count': 2,
'robotstxt/response_count': 2,
'robotstxt/response_status_count/200': 2,
'scheduler/dequeued': 3,
'scheduler/dequeued/disk': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/disk': 3,
'spider_exceptions/ValueError': 1,
'start_time': datetime.datetime(2020, 5, 31, 15, 40, 56, 347360)}
[scrapy.core.engine:337|INFO] Spider closed (finished)
[newsplease.__main__:276|INFO] Graceful stop called manually. Shutting down.
okay will need to look into this. not sure if related to this issue, but there are quite some warning and errors regarding both your env as well as the site, see below. if you fix your env, particularly the certificate issue, does the issue persist?
:0: UserWarning: You do not have a working installation of the service_identity module: 'cannot import name 'opentype''. Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied. Without the service_identity module, Twisted can perform only rudimentary TLS client hostname verification. Many valid certificate/hostname mappings may be rejected.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.25.9) or chardet (2.3.0) doesn't match a supported version!
RequestsDependencyWarning)
[newsplease.config:161|INFO] Loading config-file (/home/nicola/news-please-repo/config/config.cfg)
:0: UserWarning: You do not have a working installation of the service_identity module: 'cannot import name 'opentype''. Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied. Without the service_identity module, Twisted can perform only rudimentary TLS client hostname verification. Many valid certificate/hostname mappings may be rejected.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.25.9) or chardet (2.3.0) doesn't match a supported version!
RequestsDependencyWarning)
...
[scrapy.core.downloader.tls:84|WARNING] Remote certificate is not valid for hostname "www.ladigetto.it"; '*.ladigetto.it'!='www.ladigetto.it'
I will try soon,
but it seems that the error is not due no certificates or the "service_identity module".
As you said they are just warnings
I solved the warnings, but the error is still there.
[newsplease.config:161|INFO] Loading config-file (/home/nicola/news-please-repo/config/config.cfg)
[newsplease.config:161|INFO] Loading config-file (/home/nicola/news-please-repo/config/config.cfg)
[__main__:260|INFO] Removed /home/nicola/news-please-repo/.resume_jobdir/04451a9020ce2dd6898045f90a11cb3d since '--resume' was not passed to initial.py or this crawler was daemonized.
[scrapy.utils.log:146|INFO] Scrapy 2.1.0 started (bot: news-please)
[scrapy.utils.log:149|INFO] Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.8 (default, Dec 24 2018, 19:24:27) - [GCC 5.4.0 20160609], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Linux-4.15.0-47-generic-x86_64-with-Ubuntu-16.04-xenial
[scrapy.crawler:60|INFO] Overridden settings:
{'BOT_NAME': 'news-please',
'CONCURRENT_REQUESTS_PER_DOMAIN': 4,
'LOG_FORMAT': '[%(name)s:%(lineno)d|%(levelname)s] %(message)s',
'LOG_LEVEL': 'INFO',
'NEWSPIDER_MODULE': 'newsplease.crawler.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['newsplease.crawler.spiders'],
'USER_AGENT': 'news-please (+http://www.example.com/)'}
[scrapy.extensions.telnet:55|INFO] Telnet Password: 5b97100858c721da
[scrapy.middleware:48|INFO] Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.spiderstate.SpiderState']
[scrapy.middleware:48|INFO] Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
[scrapy.middleware:48|INFO] Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
[newsplease.pipeline.extractor.article_extractor:34|INFO] Extractor initialized: newspaper_extractor
[newsplease.pipeline.extractor.article_extractor:34|INFO] Extractor initialized: readability_extractor
[newsplease.pipeline.extractor.article_extractor:34|INFO] Extractor initialized: date_extractor
[newsplease.pipeline.extractor.article_extractor:34|INFO] Extractor initialized: lang_detect_extractor
[scrapy.middleware:48|INFO] Enabled item pipelines:
['newsplease.pipeline.pipelines.ArticleMasterExtractor',
'newsplease.pipeline.pipelines.HtmlFileStorage',
'newsplease.pipeline.pipelines.JsonFileStorage']
[scrapy.core.engine:268|INFO] Spider opened
[scrapy.extensions.logstats:48|INFO] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[scrapy.extensions.telnet:69|INFO] Telnet console listening on 127.0.0.1:6023
[scrapy.core.scraper:168|ERROR] Spider error processing <GET https://www.ladigetto.it/> (referer: None)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/defer.py", line 117, in iter_errback
yield next(it)
File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/python.py", line 345, in __next__
return next(self.data)
File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/python.py", line 345, in __next__
return next(self.data)
File "/usr/local/lib/python3.6/dist-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/usr/local/lib/python3.6/dist-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/referer.py", line 338, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python3.6/dist-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python3.6/dist-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python3.6/dist-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/usr/local/lib/python3.6/dist-packages/newsplease/crawler/spiders/recursive_crawler.py", line 49, in parse
self.ignore_file_extensions):
File "/usr/local/lib/python3.6/dist-packages/newsplease/helper_classes/parse_crawler.py", line 116, in recursive_requests
for href in response.css("a::attr('href')").extract() if re.match(
File "/usr/local/lib/python3.6/dist-packages/newsplease/helper_classes/parse_crawler.py", line 120, in <listcomp>
and len(re.match(ignore_regex, response.urljoin(href)).group(0)) == 0
File "/usr/local/lib/python3.6/dist-packages/scrapy/http/request/__init__.py", line 25, in __init__
self._set_url(url)
File "/usr/local/lib/python3.6/dist-packages/scrapy/http/request/__init__.py", line 69, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: mailto:redazione@ladigetto.it
[scrapy.core.engine:306|INFO] Closing spider (finished)
[scrapy.statscollectors:47|INFO] Dumping Scrapy stats:
{'downloader/request_bytes': 1802,
'downloader/request_count': 8,
'downloader/request_method_count/GET': 8,
'downloader/response_bytes': 18408,
'downloader/response_count': 8,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/301': 5,
'elapsed_time_seconds': 0.731866,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 5, 31, 16, 26, 50, 257276),
'log_count/ERROR': 1,
'log_count/INFO': 14,
'memusage/max': 106803200,
'memusage/startup': 106803200,
'response_received_count': 3,
'robotstxt/request_count': 2,
'robotstxt/response_count': 2,
'robotstxt/response_status_count/200': 2,
'scheduler/dequeued': 3,
'scheduler/dequeued/disk': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/disk': 3,
'spider_exceptions/ValueError': 1,
'start_time': datetime.datetime(2020, 5, 31, 16, 26, 49, 525410)}
[scrapy.core.engine:337|INFO] Spider closed (finished)
[newsplease.__main__:276|INFO] Graceful stop called manually. Shutting down.
For me the problem was solved by defining an ignore_regex in the config.cfg
ignore_regex = "(mail[tT]o)|([jJ]avascript)|(tel)|(fax)"
For me the problem was solved by defining an ignore_regex in the config.cfg
ignore_regex = "(mail[tT]o)|([jJ]avascript)|(tel)|(fax)"
This should be in the defaults
added it :)
Mandatory
Related issues:
Describe your question I trying to crawl from its url "http://www.ladigetto.it" and I get the following error from scrapy
By looking at the content of the initial page of the url, I actually see the href generating the error.
<a href="mailto:redazione@ladigetto.it">redazione@ladigetto.it</a>
Is it possible to force news-please not to follow this kind of link. Is it possible to do it through the configuration file?
Versions (please complete the following information):
Intent (optional; we'll use this info to prioritize upcoming tasks to work on)