eafer / rdrview

Firefox Reader View as a command line tool
Apache License 2.0
836 stars 35 forks source link

Add support for custom user-agent string #9

Closed sanel closed 3 years ago

sanel commented 3 years ago

By default, rdrview will user own user-agent which can't be changed. However, some sites will refuse access if user-agent is not "usual" one, like firefox/chromium/etc.

Also, some sites can alter (even simplify) the content if user-agent is changed, which can make rdrview render it more easily.

sanel commented 3 years ago

This PR addresses, in some way, https://github.com/eafer/rdrview/pull/7

I added rdrview as rendering engine in declutter and, from previous experience, there are some sites that will not allow access unless user-agent is "the known" one. Also, proxies like Cloudflare (these days may sites are using) can became very suspicious to unusual user-agent strings.

eafer commented 3 years ago

Thanks for the patch. I'm not crazy about the user agent option because I don't want to start bloating the interface. Once I add this, other people could reasonably request support for session cookies, for example. I think it would be best if those people just used curl or wget, and piped the output to rdrview.

That said, this feature has been requested 5 times already, and you seem to have done all the work, so I guess I'll just give in and pick it up.

there are some sites that will not allow access unless user-agent is "the known" one. Also, proxies like Cloudflare (these days may sites are using) can became very suspicious to unusual user-agent strings.

Can you share any urls that are giving you trouble?

sanel commented 3 years ago

Thanks for the patch. I'm not crazy about the user agent option because I don't want to start bloating the interface. Once I add this, other people could reasonably request support for session cookies, for example. I think it would be best if those people just used curl or wget, and piped the output to rdrview.

Make sense. IMHO, adding cookie support will move rdrview into browser realm :)

That said, this feature has been requested 5 times already, and you seem to have done all the work, so I guess I'll just give in and pick it up.

:+1:

there are some sites that will not allow access unless user-agent is "the known" one. Also, proxies like Cloudflare (these days may sites are using) can became very suspicious to unusual user-agent strings.

Can you share any urls that are giving you trouble?

https://datanami.com will reject anything that doesn't sound like a usual browser, even curl. Luckily, it will not complain for rdrview, at least for now.

However, finance.yahoo.com will emit different html depending on user-agent. If you go to the page [1] with usual rdrview, you won't get actual article intro with "Continue reading" (like with e.g. chromium) but something unrelated. If you change it to e.g. Safari from iOS3 [2], it will work.

[1] https://finance.yahoo.com/m/cd129a99-625f-32df-9f4b-fdfd45e8705f/fastest-growing-stocks-.html [2] https://developers.whatismybrowser.com/useragents/parse/98042-safari-ios-iphone-webkit

eafer commented 3 years ago

However, finance.yahoo.com will emit different html depending on user-agent. If you go to the page [1] with usual rdrview, you won't get actual article intro with "Continue reading" (like with e.g. chromium) but something unrelated. If you change it to e.g. Safari from iOS3 [2], it will work.

[1] https://finance.yahoo.com/m/cd129a99-625f-32df-9f4b-fdfd45e8705f/fastest-growing-stocks-.html [2] https://developers.whatismybrowser.com/useragents/parse/98042-safari-ios-iphone-webkit

Mmhhh that's not very convincing. It seems that the only difference between the document you get with the regular user agent and the one you get pretending to be Safari is that the ios version doesn't include excerpts of other articles. Since the main article has almost no content, the excerpts are enough to confuse the readability algorithm.

So the user agent switch has an effect in this case, but it's just a coincidence, and it won't get you any real content.

I intend to pick this up anyway, at least as a workaround in case the default user agent starts getting rejected.

eafer commented 3 years ago

Your patch is applied now. Thanks again!