dshanske / parse-this

Parse This Parsing Library for WordPress- Can Act as a Standalone Plugin
GNU General Public License v2.0
15 stars 3 forks source link

Add application/rss+xml detection for text/html content-type #67

Open bekopharm opened 4 years ago

bekopharm commented 4 years ago

Hej, got a feed for my Yarn server from a Typo3 system where a feed is delivered as text/html.

While this is wrong (in theory) W3C only yields a warning but verifies the feed as fine otherwise: https://validator.w3.org/feed/check.cgi?url=https%3A%2F%2Fwww.baden-wuerttemberg.de%2Fde%2Fservice%2Frss%2Fxml%2Frss-alle-meldungen%2F

Other readers I tried also don't make a fuss so I traced the issue down to the Parse_This_Discovery class function fetch() where the feed ends up in the text/html block where it yields an empty result.

I ended up with a workaround by adding a proxy rule to my nginx like this:

    location /bw-feed {
      proxy_pass https://www.baden-wuerttemberg.de/de/service/rss/xml/rss-alle-meldungen/;
      proxy_set_header Accept "application/rss+xml;q=0.9,image/webp,*/*;q=0.8";
      proxy_hide_header Content-Type;
      add_header Content-Type "application/rss+xml; charset=utf-8";
    }

This overrides the shipped Content-Type (and also requests as application/rss+xml but the remote server doesn't care and ships test/html always) which satisfies Parse-This and allows the feed to be parsed just fine.

While this is certainly not ideal I can now read the feed on my Yarn MicroSub server.

Please add/improve application/rss+xml detection for Content-Type text/html. I experienced this with Typo3 systems before and there are probably many sites out there advertising the wrong type.

dshanske commented 4 years ago

They are misrepresenting the feed though

dshanske commented 4 years ago

I'm getting it as text/xml, which I do interpret as a feed.

bekopharm commented 4 years ago

Yes, they fixed it because I notified them about the problem.

Kinda surprised that a government website fixed something like this within 5 days.

This is still a very common problem usually with Typo3 backends.

If you need it as text/html again for testing I'd gladly reconfigure my proxy to do so :)