Closed meme-lord closed 4 years ago
This is an intentional design decision, and is working as intended. When Discord crawls a URL we perform that action as a bot. However, when we proxy images we are acting in response to a user loading that image. Because of this distinction we provide a user agent of a user, not a bot.
Surely the user agent from the crawl should be updated from Mozilla/5.0 (compatible; Discordbot/2.0; +https://discordapp.com)
to Mozilla/5.0 (compatible; Discordbot/2.0; +https://discord.com)
however?
The user-agent is supposed to show what application is requesting the page. It has nothing to do with whether or not it is "acting in response to a user". The current user-agent does NOT accurately represent the application making the request as I highly doubt your servers are running MacOS.
Also as the two user-agents are different it causes bugs when generating previews as some websites will display an image for bots to create previews but display a webpage if a browser is making the request. So the crawler thinks there's an image and then the follow up gets a HTML page and the preview cant generate an image from the HTML page.
Other services with similar preview functionality like Facebook, Twitter, Telegram send a "bot" user-agent and don't just use arbitrary browser user-agents.
This is indeed some strange reasoning, especially considering the initial metadata request does use a Discord user-agent.
Every other bot and application that I am aware uses a clearly identifiable user-agent when requesting metadata and images.
Agreed. It's bad enough that, for my little "no cloud service needed" image pastebin tool for the privacy-conscious (UPnP and "what is my external IP address?" support planned before I make my first release), I feel the need to set up:
robots.txt
(Which you ignore)X-Robots-Tag
for a second chance at getting bots to understand that they're not welcome if they go after a direct image link.<meta name="robots">
for bots written by people who don't know about X-Robots-Tag
.Discord
, Slack
, and various other bots that have either decided "I'm not a crawler, so I'll ignore robots.txt
" or don't clearly say whether they're supposed to obey it.--random-auth
option which switches on HTTP Basic authentication and prints a ready-to-share http://user:pass@ip/
URL to the console so that anything not on the blacklist has one final chance to run into the oEmbed-specified "401 Unauthorized
means 'private (non-public) resource. Do not embed'".EDIT: To clarify, that last one currently works because, as far as I can tell, at least when running in-browser, Discord will display the URL to the user with the user:pass
portion stripped and attempt to retrieve it without HTTP Basic auth, but still actually include the user:pass
portion in the href
attribute... which makes --random-auth
a great Just Worksâ„¢ concept for sharing a folder of images where, even if the URL stays in the chatlog permanently and the ip:port
portion remain long-term stable (eg. thanks to IPv6), the user:pass
portion will become invalid as soon as that session ends, similar to those Reddit userscripts which overwrite the contents of posts on the idea that Reddit stores the last contents of deleted posts but not revision histories for them.
Steps to Reproduce:
Expected Result: All the requests to generate a preview use the Discordbot user-agent
Actual Result: Only the first request has the Discordbot user-agent
Mozilla/5.0 (compatible; Discordbot/2.0; +https://discordapp.com)
, the ones after that areMozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Firefox/38.0
which seems very arbitrary.The second useragent should either be the same or include Discordbot in it so that you can tell its discord or not a web browser.
I also think it shouldn't need more than one request to generate a preview and a lot of bandwidth is being wasted 👀