livecheck: add user_agent

razvanazamfirei commented 1 year ago

Verification

[X] This issue's title and/or description do not reference a single formula e.g. brew install wget. If they do, open an issue at https://github.com/Homebrew/homebrew-core/issues/new/choose instead.

Provide a detailed description of the proposed feature

Allow for user_agent to be specified for livecheck curl requests. In essence, it would reuse the existing implementation from the url stanza and apply it to the livecheck url if needed.

What is the motivation for the feature?

Increasing number of casks are failing livecheck audits due to CDNs blocking the requests. This would add the flexibility to use user_agent if needed. See https://github.com/Homebrew/homebrew-cask/pull/146350 https://github.com/Homebrew/homebrew-cask/pull/146223 https://github.com/Homebrew/homebrew-cask/pull/146330 for recent failures.

How will the feature be relevant to at least 90% of Homebrew users?

Maintainers will spend less time troubleshooting livechecks which will increase the time they spend on other user-facing projects.

What alternatives to the feature have been considered?

Not implementing it and just skipping CI when the issue comes up.

reitermarkus commented 1 year ago

I think this is likely due to https://github.com/Homebrew/brew/pull/15099 still being open.

reitermarkus commented 1 year ago

Nevermind, these are using :page_match, so doesn't seem related to the HEAD request issue.

Increasing number of casks are failing livecheck audits due to CDNs blocking the requests.

Does using a different user agent actually fix that though? Livecheck already uses two different user agents (Homebrew and fake Safari).

samford commented 1 year ago

This is a planned feature that I already have implemented locally but it depends on the "options" feature and I need to create a PR for that first (that work is also done but there may be some discussion about the shape of it). I'm working my way towards those PRs but I'm currently working on the GitHub strategy PRs and I also need to wrap up that HeaderMatch PR that Markus mentioned.

daeho-ro commented 3 months ago

When I try to add the livecheck url for the cask hookmark, it seems that the livecheck command is working good. But the audit is failure, since the livecheck URL is not reachable by error code 403.

After digging an hour, finally noticed that the reason why those two commands are working in differ is the existence of the user-agent header. https://github.com/Homebrew/brew/blob/master/Library/Homebrew/cask/audit.rb#L810

Is this feature development ongoing? or should I drop my changes?

daeho-ro commented 3 months ago

Oh... two of the examples are hookmark and this is the problematic cask! I visited again this package haha.

samford commented 3 months ago

Is this feature development ongoing? or should I drop my changes?

@daeho-ro Oddly enough, I was working on this again yesterday. I had a working implementation when I last left it but I'm reworking the changes to the livecheck block DSL before creating a PR. Depending on how I approach it, I may have to make some changes to Utils::Curl beforehand but I'm working on getting user agent support done soon (while also laying the groundwork for other needed curl-related features).

MikeMcQuaid commented 3 months ago

Allow for user_agent to be specified for livecheck curl requests. Livecheck already uses two different user agents (Homebrew and fake Safari).

The more I think about this: the more I think we shouldn't be changing/faking our user-agents at all, let alone let them be customised per-formula.

Ultimately our user agents being blocked is a response by the upstream project to our requests. We should respect that and do something different rather than just workaround it.

samford commented 1 month ago

The more I think about this: the more I think we shouldn't be changing/faking our user-agents at all, let alone let them be customised per-formula.

Ultimately our user agents being blocked is a response by the upstream project to our requests. We should respect that and do something different rather than just workaround it.

I agree that we shouldn't use a different user agent if a server specifically blocks requests because HOMEBREW_USER_AGENT_CURL includes "Homebrew", as that's a clear signal about upstream's wishes. However, if we're simply blocked because our user agent isn't something common/typical (e.g., a browser) or shares some characteristics with bots (e.g., including "curl" text), that may not be as strong of a signal in terms of upstream's intentions.

This only affects casks at the moment, likely because commercial/closed-source software websites can be more aggressive about blocking atypical requests. We've had some livecheck blocks that work locally but fail on CI because the combination of the server's IP address and a non-browser user agent is enough to trigger server-side anti-bot measures (this can also happen locally if you use a VPN). In some of those cases, using a :browser user agent allows the check to work fine on CI.

livecheck has had logic to fall back to the :browser user agent if :default fails since mid-2021, so specifying the user agent in a livecheck block is simply about efficiency (avoiding unnecessary requests that we know will fail) and clarity. [My plan has always been to remove the automatic user agent retry logic in the Livecheck::Strategy methods once we can specify a user agent in livecheck blocks, as the former was just a stop-gap measure until the latter is available.]

In my current implementation, user agent support is simply part of being able to specify #curl_args arguments in a livecheck block, as that would allow us to handle other weird situations (though if that's too much, I could take a more limited approach). If we're against swapping user agents in general, we would have to remove existing user agent-switching logic throughout brew, remove packages as soon as they're unreachable with our user agent (there are currently 40 casks like this), and disallow new packages based on this criterion. As mentioned, some requests work locally but fail on CI, so it seems like a shame to drop working packages for that reason.

365d install counts for existing casks specifying a user_agent for the cask url are as follows:

adobe-acrobat-pro: 2479 adrive: 2943 attachecase: 89 chalk: 109 clickcharts: 37 coinomi-wallet: 53 dehelper: 57 disk-inventory-x: 3230 elgato-wave-link: 383 eudic: 1143 font-andagii: 18 font-constructium: 16 font-fairfax: 14 frhelper: 33 hookmark: 280 hopper-disassembler: 36 ithoughtsx: 85 jt-bridge: 8 jump: 228 keycue: 214 latexit: 1837 ludwig: 28 neteasemusic: 2649 outguess: 395 pages-data-merge: 55 paraview: 1854 photozoom-pro: 10 pikopixel: 52 popchar: 54 subsurface: 90 tableau: 1169 tableau-prep: 142 tableau-public: 265 tableau-reader: 52 todesk: 2029 typinator: 172 unite-phone: 5 yandex: 1483 youdaonote: 281

MikeMcQuaid commented 1 month ago

However, if we're simply blocked because our user agent isn't something common/typical (e.g., a browser) or shares some characteristics with bots (e.g., including "curl" text), that may not be as strong of a signal in terms of upstream's intentions.

Agreed 👍🏻

remove packages as soon as they're unreachable with our user agent (there are currently 40 casks like this), and disallow new packages based on this criterion

I'd agree with this 👍🏻

As mentioned, some requests work locally but fail on CI, so it seems like a shame to drop working packages for that reason.

I could see us allowing these to remain if so.

Homebrew / brew