custom-components / feedparser

📰 RSS Feed Integration
MIT License
135 stars 34 forks source link

Feed sensor return empty state for valid rss #112

Closed juliomatcom closed 8 months ago

juliomatcom commented 9 months ago

Hi all, I'm using the next configuration but no entries are retrieved

sensor:
  - platform: feedparser
    name: elcomercio
    feed_url: 'https://www.elcomercio.es/rss/2.0/?section=gijon'
    show_topn: 20
    scan_interval:
      hours: 1

screenshot-hass lan-2023 12 10-11_59_29

It does work with the example provided so I guess it should be related to the xml format or the response somehow.

ogajduse commented 9 months ago

The feed that you are using seems to be valid.

I see that the upstream feedparser library does not resolve the extra HTTP GET parameters that follow after ?.

$ python
Python 3.11.5 (main, Aug 28 2023, 00:00:00) [GCC 13.2.1 20230728 (Red Hat 13.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import feedparser
>>> parsed_feed = feedparser.parse('https://www.elcomercio.es/rss/2.0/?section=gijon')
>>> parsed_feed
{'bozo': 1, 'entries': [], 'feed': {'summary': '<h1>Access Denied</h1>\n \nYou don\'t have permission to access "http://www.elcomercio.es/rss/2.0/?" on this server.<p>\nReference #18.bf361060.1702380172.cd132963'}, 'headers': {'server': 'AkamaiGHost', 'content-length': '292', 'content-type': 'text/html', 'mime-version': '1.0', 'vary': 'User-Agent,Cookie,Accept-Encoding', 'alt-svc': 'h3=":443"; ma=93600', 'expires': 'Tue, 12 Dec 2023 11:22:52 GMT', 'cache-control': 'max-age=0, no-cache', 'pragma': 'no-cache', 'date': 'Tue, 12 Dec 2023 11:22:52 GMT', 'connection': 'close'}, 'href': 'https://www.elcomercio.es/rss/2.0/?section=gijon', 'status': 403, 'encoding': 'us-ascii', 'bozo_exception': SAXParseException('mismatched tag'), 'version': '', 'namespaces': {}}
>>> parsed_feed.status
403

https://github.com/kurtmckee/feedparser/issues/385 describes the same issue. We could use requests library to put extra headers to the HTTP request that elcomercio RSS feed requires.

>>> import feedparser
>>> import requests
>>> response = requests.get("https://www.elcomercio.es/rss/2.0/?section=gijon", headers={"User-Agent": "someagent"})
>>> response.ok
True
>>> response.text
'<?xml version="1.0" encoding="UTF-8"?>\n<rss xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">\n  <channel>\n    <atom:link href="https://www.elcomercio.es/rss/2.0/?section=gijon"  ... output ommited ...'
>>> parsed_feed = feedparser.parse(response.text)
>>> len(parsed_feed.entries)
100

I can take it and fix it. That should be a simple fix.

ogajduse commented 9 months ago

@juliomatcom #115 should fix the issue you are seeing. I would be glad if you could test it and confirm that it fixes the issue for you.

However, this specific feed does not provide the full URL to the image, so you will not be able to render an image in your Lovelace.

image

juliomatcom commented 9 months ago

Hi @ogajduse, thank you for taking a look, I cloned your repo and changed to the feat/add-http-headers-to-request branch and still no data in state from https://www.elcomercio.es/rss/2.0/?section=gijon, is this how I should test this ? I don't have any experience debugging HA nor Python.

ogajduse commented 8 months ago

@juliomatcom I have merged #115 into master and released https://github.com/custom-components/feedparser/releases/tag/0.2.0b6. That should allow you to install the beta release directly from HACS. Check HACS docs on how to install it. I am still interested in your feedback.