Discord uses the wrong user-agent when fetching URL/File previews

meme-lord commented 4 years ago

Steps to Reproduce:

Set up a webserver
Put an image file on it
Link the image on discord
Monitor the requests

Expected Result: All the requests to generate a preview use the Discordbot user-agent

Actual Result: Only the first request has the Discordbot user-agent Mozilla/5.0 (compatible; Discordbot/2.0; +https://discordapp.com), the ones after that are Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Firefox/38.0 which seems very arbitrary.

The second useragent should either be the same or include Discordbot in it so that you can tell its discord or not a web browser.

I also think it shouldn't need more than one request to generate a preview and a lot of bandwidth is being wasted 👀

night commented 4 years ago

This is an intentional design decision, and is working as intended. When Discord crawls a URL we perform that action as a bot. However, when we proxy images we are acting in response to a user loading that image. Because of this distinction we provide a user agent of a user, not a bot.

Skye-31 commented 4 years ago

Surely the user agent from the crawl should be updated from Mozilla/5.0 (compatible; Discordbot/2.0; +https://discordapp.com) to Mozilla/5.0 (compatible; Discordbot/2.0; +https://discord.com) however?

meme-lord commented 4 years ago

The user-agent is supposed to show what application is requesting the page. It has nothing to do with whether or not it is "acting in response to a user". The current user-agent does NOT accurately represent the application making the request as I highly doubt your servers are running MacOS.

Also as the two user-agents are different it causes bugs when generating previews as some websites will display an image for bots to create previews but display a webpage if a browser is making the request. So the crawler thinks there's an image and then the follow up gets a HTML page and the preview cant generate an image from the HTML page.

Other services with similar preview functionality like Facebook, Twitter, Telegram send a "bot" user-agent and don't just use arbitrary browser user-agents.

xPaw commented 4 years ago

This is indeed some strange reasoning, especially considering the initial metadata request does use a Discord user-agent.

Every other bot and application that I am aware uses a clearly identifiable user-agent when requesting metadata and images.

ssokolow commented 3 years ago

Agreed. It's bad enough that, for my little "no cloud service needed" image pastebin tool for the privacy-conscious (UPnP and "what is my external IP address?" support planned before I make my first release), I feel the need to set up:

robots.txt (Which you ignore)
X-Robots-Tag for a second chance at getting bots to understand that they're not welcome if they go after a direct image link.
<meta name="robots"> for bots written by people who don't know about X-Robots-Tag.
A middleware to blacklist User-Agents containing Discord, Slack, and various other bots that have either decided "I'm not a crawler, so I'll ignore robots.txt" or don't clearly say whether they're supposed to obey it.
Instructions in the README advising users to use the --random-auth option which switches on HTTP Basic authentication and prints a ready-to-share http://user:pass@ip/ URL to the console so that anything not on the blacklist has one final chance to run into the oEmbed-specified "401 Unauthorized means 'private (non-public) resource. Do not embed'".

EDIT: To clarify, that last one currently works because, as far as I can tell, at least when running in-browser, Discord will display the URL to the user with the user:pass portion stripped and attempt to retrieve it without HTTP Basic auth, but still actually include the user:pass portion in the href attribute... which makes --random-auth a great Just Works™ concept for sharing a folder of images where, even if the URL stays in the chatlog permanently and the ip:port portion remain long-term stable (eg. thanks to IPv6), the user:pass portion will become invalid as soon as that session ends, similar to those Reddit userscripts which overwrite the contents of posts on the idea that Reddit stores the last contents of deleted posts but not revision histories for them.

discord / discord-api-docs

Discord uses the wrong user-agent when fetching URL/File previews #1600