RSS-Bridge / rss-bridge

The RSS feed for websites missing it
https://rss-bridge.org/bridge01/
The Unlicense
6.99k stars 1.02k forks source link

URL not properly formed with diacritics/accents not encoded #3991

Open wrobelda opened 4 months ago

wrobelda commented 4 months ago

Describe the bug If any of the feed query parameters contains diacritic (accent) characters, they are left as is and not encoded, which will results in some of the clients fail to add the RSS feed with a "URL invalid" error. See: https://stackoverflow.com/questions/33211310/convert-french-accent-to-specific-encoding-in-php

To Reproduce Steps to reproduce the behavior:

  1. For any chosen bridge which takes a text character parameter, use a string containing diacritic/accent characters (or copy and paste this: ąśćż)
  2. Generate feed URL
  3. Copy that feed to a RSS client of choice (it fails here with TT-RSS at least)
  4. See error

Expected behavior Diacritics/accents should be properly encoded

dvikan commented 4 months ago

i think this is a bug in TT-RSS or your browser. im not sure

wrobelda commented 4 months ago

i think this is a bug in TT-RSS or your browser. im not sure

Sorry, what bug? Per RFC 3986, section 2.3, the URL should consist of only comprise of specific character set, which does not contain non-ascii characters, period. Any other characters need to be UTF-8 encoded, per RFC3987.

Meanwhile RSS-Bridge allows those characters to make it to the URL. Sure, modern browsers or some clients will automatically UTF-8 encode such query before they send it outside to webservers, but RSS-Bridge should not rely on that and instead generate a feed URL that conforms to the standards.

See also: https://www.w3schools.com/tags/ref_urlencode.ASP

dvikan commented 4 months ago

are you copy pasting url from browser?

are you talking about those urls that are produced inside <link> tags?

i was unable to reproduce. using firefox.

Bockiii commented 4 months ago

Reproducible.

Search and result on reddit with a german umlaut "ä". Similar problem than the accented french characters. image

RSS bridge config image

Result on dvikans public instance image

dvikan commented 4 months ago

okay i get it.

it happens when parameters are used in http requests without url encoding them.

in the particular case of RedditBridge a solution is to manually url encode the user input parts.

related: https://github.com/RSS-Bridge/rss-bridge/issues/3091

wrobelda commented 3 months ago

in the particular case of RedditBridge a solution is to manually url encode the user input parts.

That means each and every bridge has to handle encoding themselves for each of their arbitrary string inputs, whereas RSS-Bridge could do this itself once by encoding the complete feed URL it generated. There's no harm here: any characters needing encoding will get encoded, otherwise it will be left as is.

Not to mention the bridge code should not be concerned with things like that — its scope is to prepare articles and their content in UTF-8, not handle the intrinsics of HTTP communication between the RSS-Bridge server and an RSS client.

No offense, but I think you downplay the seriousness of this issue for any non-ASCII languages.

dvikan commented 3 months ago

I like your arguments. Okay let me dwell a bit on it.

dvikan commented 3 months ago

@Bockiii fixed for reddit in https://github.com/RSS-Bridge/rss-bridge/pull/4010

dvikan commented 3 months ago

i have discovered that curl will automatically escape the url if needed.

but if curl detects an already escaped url, it will NOT escape.

so this particular error only happens if a url is already partially escaped (as was the case with RedditBridge),

wrobelda commented 3 months ago

i have discovered that curl will automatically escape the url if needed.

but if curl detects an already escaped url, it will NOT escape.

so this particular error only happens if a url is already partially escaped (as was the case with RedditBridge),

The problem here is not with how RSS handles that internally (i.e. the curl lib that it uses), but on the outside, i.e. with the RSS clients that you pass unescaped RSS-Bridge URL to.

In other words, we need to make sure that the URL generated and returned to the user (opened in a new browser tab) by the RSS Bridge after you click "Generate Feed" needs to be properly formed.

dvikan commented 3 months ago

im confused now. can you give an example?

dvikan commented 2 months ago

for the record i did some changes related to this issue in https://github.com/RSS-Bridge/rss-bridge/commit/545dc969d35bc8c94a8c15875562690ee2fd6605 but they are a refactor (should not be externally visible changes)

dvikan commented 1 week ago

here is a URL (manually copied from firefox url bar).

its HTML have URLs being properly encoded (as you requested)

it has always been like this as far as I can tell.

https://rss-bridge.org/bridge01/?action=display&bridge=FilterBridge&url=https%3A%2F%2Florem-rss.herokuapp.com%2Ffeed%3Funit%3Dday&filter=%C4%85%C5%9B%C4%87%C5%BC&filter_type=permit&target_title=on&length_limit=-1&format=Html

pls give example of a non-encoded url being produced

@Mynacol pls give feedback on this issue.