problem with 'mailto' links

nicolabertoldi commented 4 years ago

Mandatory

[x] I read the documentation (readme and wiki).
[x] I searched other issues (including closed issues) and could not find any to be related. If you find related issues post them below or directly add your issue to the most related one.

Related issues:

I did not find any related issues in news-please and scrapy

Describe your question I trying to crawl from its url "http://www.ladigetto.it" and I get the following error from scrapy

  File "/usr/local/lib/python3.6/dist-packages/scrapy/http/request/__init__.py", line 25, in __init__
    self._set_url(url)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/http/request/__init__.py", line 69, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: mailto:redazione@ladigetto.it

By looking at the content of the initial page of the url, I actually see the href generating the error. <a href="mailto:redazione@ladigetto.it">redazione@ladigetto.it</a>

Is it possible to force news-please not to follow this kind of link. Is it possible to do it through the configuration file?

Versions (please complete the following information):

OS: Ubuntu 16.04.6 LTS
Python Version: 3.6.8
news-please Version: 1.5.3

Intent (optional; we'll use this info to prioritize upcoming tasks to work on)

[ ] personal
[ ] academic
[ ] business
[ ] other
Some information on your project:

fhamborg commented 4 years ago

could you provide a minimal working example so that i can reproduce the issue? e.g., a config file (if you were using cli mode) or some lines of codes (for library mode)

nicolabertoldi commented 4 years ago

I am using the cli.

I just run news-please

with the following standard config file (I just modified the URL)

# This is a HJSON-File, so comments and so on can be used! See https://hjson.org/
# Furthermore this is first of all the actual config file, but as default just filled with examples.
{
  # Every URL has to be in an array-object in "base_urls".
  # The same URL in combination with the same crawler may only appear once in this array.
  "base_urls" : [
    {
      # Start crawling from ladigetto.it
      "url": "https://ladigetto.it",

      # Overwrite the default crawler and use th RecursiveCrawler instead
      "crawler": "RecursiveCrawler",

      # Because this site is weirt, use the
      # meta_contains_article_keyword-heuristic and disable all others because
      # overwrite will merge the defaults from "newscrawler.cfg" with
      # this
      "overwrite_heuristics": {
        "meta_contains_article_keyword": true,
        "og_type": false,
        "linked_headlines": false,
        "self_linked_headlines": false
      },
      # Also state that in the condition, all heuristics used in the condition
      # have to be activated in "overwrite_heuristics" (or default) as well.
      "pass_heuristics_condition": "meta_contains_article_keyword"
    }
  ]
}

fhamborg commented 4 years ago

thanks. the stacktrace you posted is pretty short, is it complete or just an excerpt? ie its missing the news-please modules. could you pls post the full stack trace?

nicolabertoldi commented 4 years ago

The full log

$> news-please
:0: UserWarning: You do not have a working installation of the service_identity module: 'cannot import name 'opentype''.  Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied.  Without the service_identity module, Twisted can perform only rudimentary TLS client hostname verification.  Many valid certificate/hostname mappings may be rejected.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.25.9) or chardet (2.3.0) doesn't match a supported version!
  RequestsDependencyWarning)
[newsplease.config:161|INFO] Loading config-file (/home/nicola/news-please-repo/config/config.cfg)
:0: UserWarning: You do not have a working installation of the service_identity module: 'cannot import name 'opentype''.  Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied.  Without the service_identity module, Twisted can perform only rudimentary TLS client hostname verification.  Many valid certificate/hostname mappings may be rejected.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.25.9) or chardet (2.3.0) doesn't match a supported version!
  RequestsDependencyWarning)
[newsplease.config:161|INFO] Loading config-file (/home/nicola/news-please-repo/config/config.cfg)
[scrapy.utils.log:146|INFO] Scrapy 2.1.0 started (bot: news-please)
[scrapy.utils.log:149|INFO] Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.8 (default, Dec 24 2018, 19:24:27) - [GCC 5.4.0 20160609], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.9.2, Platform Linux-4.15.0-47-generic-x86_64-with-Ubuntu-16.04-xenial
[scrapy.crawler:60|INFO] Overridden settings:
{'BOT_NAME': 'news-please',
 'CONCURRENT_REQUESTS_PER_DOMAIN': 4,
 'LOG_FORMAT': '[%(name)s:%(lineno)d|%(levelname)s] %(message)s',
 'LOG_LEVEL': 'INFO',
 'NEWSPIDER_MODULE': 'newsplease.crawler.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['newsplease.crawler.spiders'],
 'USER_AGENT': 'news-please (+http://www.example.com/)'}
[scrapy.extensions.telnet:55|INFO] Telnet Password: 961215d93e9de0ec
[scrapy.middleware:48|INFO] Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.spiderstate.SpiderState']
[scrapy.middleware:48|INFO] Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
[scrapy.middleware:48|INFO] Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
[newsplease.pipeline.extractor.article_extractor:34|INFO] Extractor initialized: newspaper_extractor
[newsplease.pipeline.extractor.article_extractor:34|INFO] Extractor initialized: readability_extractor
[newsplease.pipeline.extractor.article_extractor:34|INFO] Extractor initialized: date_extractor
[newsplease.pipeline.extractor.article_extractor:34|INFO] Extractor initialized: lang_detect_extractor
[scrapy.middleware:48|INFO] Enabled item pipelines:
['newsplease.pipeline.pipelines.ArticleMasterExtractor',
 'newsplease.pipeline.pipelines.HtmlFileStorage',
 'newsplease.pipeline.pipelines.JsonFileStorage']
[scrapy.core.engine:268|INFO] Spider opened
[scrapy.extensions.logstats:48|INFO] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[scrapy.extensions.telnet:69|INFO] Telnet console listening on 127.0.0.1:6023
[scrapy.core.downloader.tls:84|WARNING] Remote certificate is not valid for hostname "www.ladigetto.it"; '*.ladigetto.it'!='www.ladigetto.it'
[scrapy.core.scraper:168|ERROR] Spider error processing <GET https://www.ladigetto.it/> (referer: None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/defer.py", line 117, in iter_errback
    yield next(it)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/python.py", line 345, in __next__
    return next(self.data)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/python.py", line 345, in __next__
    return next(self.data)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/local/lib/python3.6/dist-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/referer.py", line 338, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python3.6/dist-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python3.6/dist-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python3.6/dist-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/usr/local/lib/python3.6/dist-packages/newsplease/crawler/spiders/recursive_crawler.py", line 49, in parse
    self.ignore_file_extensions):
  File "/usr/local/lib/python3.6/dist-packages/newsplease/helper_classes/parse_crawler.py", line 116, in recursive_requests
    for href in response.css("a::attr('href')").extract() if re.match(
  File "/usr/local/lib/python3.6/dist-packages/newsplease/helper_classes/parse_crawler.py", line 120, in <listcomp>
    and len(re.match(ignore_regex, response.urljoin(href)).group(0)) == 0
  File "/usr/local/lib/python3.6/dist-packages/scrapy/http/request/__init__.py", line 25, in __init__
    self._set_url(url)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/http/request/__init__.py", line 69, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: mailto:redazione@ladigetto.it
[scrapy.core.engine:306|INFO] Closing spider (finished)
[scrapy.statscollectors:47|INFO] Dumping Scrapy stats:
{'downloader/request_bytes': 1802,
 'downloader/request_count': 8,
 'downloader/request_method_count/GET': 8,
 'downloader/response_bytes': 18169,
 'downloader/response_count': 8,
 'downloader/response_status_count/200': 3,
 'downloader/response_status_count/301': 5,
 'elapsed_time_seconds': 0.725113,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 5, 31, 15, 40, 57, 72473),
 'log_count/ERROR': 1,
 'log_count/INFO': 14,
 'log_count/WARNING': 1,
 'memusage/max': 103735296,
 'memusage/startup': 103735296,
 'response_received_count': 3,
 'robotstxt/request_count': 2,
 'robotstxt/response_count': 2,
 'robotstxt/response_status_count/200': 2,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/disk': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/disk': 3,
 'spider_exceptions/ValueError': 1,
 'start_time': datetime.datetime(2020, 5, 31, 15, 40, 56, 347360)}
[scrapy.core.engine:337|INFO] Spider closed (finished)
[newsplease.__main__:276|INFO] Graceful stop called manually. Shutting down.

fhamborg commented 4 years ago

okay will need to look into this. not sure if related to this issue, but there are quite some warning and errors regarding both your env as well as the site, see below. if you fix your env, particularly the certificate issue, does the issue persist?

:0: UserWarning: You do not have a working installation of the service_identity module: 'cannot import name 'opentype''.  Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied.  Without the service_identity module, Twisted can perform only rudimentary TLS client hostname verification.  Many valid certificate/hostname mappings may be rejected.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.25.9) or chardet (2.3.0) doesn't match a supported version!
  RequestsDependencyWarning)
[newsplease.config:161|INFO] Loading config-file (/home/nicola/news-please-repo/config/config.cfg)
:0: UserWarning: You do not have a working installation of the service_identity module: 'cannot import name 'opentype''.  Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied.  Without the service_identity module, Twisted can perform only rudimentary TLS client hostname verification.  Many valid certificate/hostname mappings may be rejected.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.25.9) or chardet (2.3.0) doesn't match a supported version!
  RequestsDependencyWarning)
...
[scrapy.core.downloader.tls:84|WARNING] Remote certificate is not valid for hostname "www.ladigetto.it"; '*.ladigetto.it'!='www.ladigetto.it'

nicolabertoldi commented 4 years ago

I will try soon,

but it seems that the error is not due no certificates or the "service_identity module".

As you said they are just warnings

nicolabertoldi commented 4 years ago

I solved the warnings, but the error is still there.

[newsplease.config:161|INFO] Loading config-file (/home/nicola/news-please-repo/config/config.cfg)
[newsplease.config:161|INFO] Loading config-file (/home/nicola/news-please-repo/config/config.cfg)
[__main__:260|INFO] Removed /home/nicola/news-please-repo/.resume_jobdir/04451a9020ce2dd6898045f90a11cb3d since '--resume' was not passed to initial.py or this crawler was daemonized.
[scrapy.utils.log:146|INFO] Scrapy 2.1.0 started (bot: news-please)
[scrapy.utils.log:149|INFO] Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.8 (default, Dec 24 2018, 19:24:27) - [GCC 5.4.0 20160609], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.9.2, Platform Linux-4.15.0-47-generic-x86_64-with-Ubuntu-16.04-xenial
[scrapy.crawler:60|INFO] Overridden settings:
{'BOT_NAME': 'news-please',
 'CONCURRENT_REQUESTS_PER_DOMAIN': 4,
 'LOG_FORMAT': '[%(name)s:%(lineno)d|%(levelname)s] %(message)s',
 'LOG_LEVEL': 'INFO',
 'NEWSPIDER_MODULE': 'newsplease.crawler.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['newsplease.crawler.spiders'],
 'USER_AGENT': 'news-please (+http://www.example.com/)'}
[scrapy.extensions.telnet:55|INFO] Telnet Password: 5b97100858c721da
[scrapy.middleware:48|INFO] Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.spiderstate.SpiderState']
[scrapy.middleware:48|INFO] Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
[scrapy.middleware:48|INFO] Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
[newsplease.pipeline.extractor.article_extractor:34|INFO] Extractor initialized: newspaper_extractor
[newsplease.pipeline.extractor.article_extractor:34|INFO] Extractor initialized: readability_extractor
[newsplease.pipeline.extractor.article_extractor:34|INFO] Extractor initialized: date_extractor
[newsplease.pipeline.extractor.article_extractor:34|INFO] Extractor initialized: lang_detect_extractor
[scrapy.middleware:48|INFO] Enabled item pipelines:
['newsplease.pipeline.pipelines.ArticleMasterExtractor',
 'newsplease.pipeline.pipelines.HtmlFileStorage',
 'newsplease.pipeline.pipelines.JsonFileStorage']
[scrapy.core.engine:268|INFO] Spider opened
[scrapy.extensions.logstats:48|INFO] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[scrapy.extensions.telnet:69|INFO] Telnet console listening on 127.0.0.1:6023
[scrapy.core.scraper:168|ERROR] Spider error processing <GET https://www.ladigetto.it/> (referer: None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/defer.py", line 117, in iter_errback
    yield next(it)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/python.py", line 345, in __next__
    return next(self.data)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/python.py", line 345, in __next__
    return next(self.data)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/local/lib/python3.6/dist-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/referer.py", line 338, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python3.6/dist-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python3.6/dist-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python3.6/dist-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/usr/local/lib/python3.6/dist-packages/newsplease/crawler/spiders/recursive_crawler.py", line 49, in parse
    self.ignore_file_extensions):
  File "/usr/local/lib/python3.6/dist-packages/newsplease/helper_classes/parse_crawler.py", line 116, in recursive_requests
    for href in response.css("a::attr('href')").extract() if re.match(
  File "/usr/local/lib/python3.6/dist-packages/newsplease/helper_classes/parse_crawler.py", line 120, in <listcomp>
    and len(re.match(ignore_regex, response.urljoin(href)).group(0)) == 0
  File "/usr/local/lib/python3.6/dist-packages/scrapy/http/request/__init__.py", line 25, in __init__
    self._set_url(url)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/http/request/__init__.py", line 69, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: mailto:redazione@ladigetto.it
[scrapy.core.engine:306|INFO] Closing spider (finished)
[scrapy.statscollectors:47|INFO] Dumping Scrapy stats:
{'downloader/request_bytes': 1802,
 'downloader/request_count': 8,
 'downloader/request_method_count/GET': 8,
 'downloader/response_bytes': 18408,
 'downloader/response_count': 8,
 'downloader/response_status_count/200': 3,
 'downloader/response_status_count/301': 5,
 'elapsed_time_seconds': 0.731866,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 5, 31, 16, 26, 50, 257276),
 'log_count/ERROR': 1,
 'log_count/INFO': 14,
 'memusage/max': 106803200,
 'memusage/startup': 106803200,
 'response_received_count': 3,
 'robotstxt/request_count': 2,
 'robotstxt/response_count': 2,
 'robotstxt/response_status_count/200': 2,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/disk': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/disk': 3,
 'spider_exceptions/ValueError': 1,
 'start_time': datetime.datetime(2020, 5, 31, 16, 26, 49, 525410)}
[scrapy.core.engine:337|INFO] Spider closed (finished)
[newsplease.__main__:276|INFO] Graceful stop called manually. Shutting down.

tobiasstrauss commented 4 years ago

For me the problem was solved by defining an ignore_regex in the config.cfg ignore_regex = "(mail[tT]o)|([jJ]avascript)|(tel)|(fax)"

jchristi commented 2 years ago

For me the problem was solved by defining an ignore_regex in the config.cfg ignore_regex = "(mail[tT]o)|([jJ]avascript)|(tel)|(fax)"

This should be in the defaults

fhamborg commented 2 years ago

added it :)

fhamborg / news-please

problem with 'mailto' links #161