Open njt1982 opened 1 month ago
It will, per Meta's docs:
The primary purpose of FacebookExternalHit is to crawl the content of an app or website that was shared on one of Meta’s family of apps, such as Facebook, Instagram, or Messenger.
I'm ok blocking Meta's products generally, but I'd weigh how important that functionality is against the traffic that you're seeing from that crawler.
Facebook sends that bot to get the Open Graph data for things like title and image for the post
Although I don't use Facebook, I'd be surprised if sharing a link on Facebook required the crawler to run. Crawlers typically run on their own schedule rather than synchronously to a user action such as creating a post. Have you experimented by monitoring access of your website by such a crawler and then posting on Facebook and seeing if a "crawl" occurs before the post is available?
I've noticed that blocking facebookexternalhit
prevents rich links (cards) from displaying in Apple Messages and Apple Mail (iOS and macOS). The change takes place immediately: blocking facebookexternalhit
immediately prevents adding rich link, allowing immediately permits adding a rich link.
facebookexternalhit isn't in our list of AI crawlers, so isn't this discussion a bit moot?
facebookexternalhit isn't in our list of AI crawlers, so isn't this discussion a bit moot?
Is this not it? https://github.com/ai-robots-txt/ai.robots.txt/blob/b1491d269460ca57581c2df7cf14b3f3fc4749f3/robots.txt#L36
(if not, what's that list for?)
Ah yes, you're right. I made the mistake of thinking the list was in alphabetical order. Apologies.
On Wed, 16 Oct 2024, 11:35 Nicholas Thompson, @.***> wrote:
facebookexternalhit isn't in our list of AI crawlers, so isn't this discussion a bit moot?
Is this not it?
(if not, what's that list for?)
— Reply to this email directly, view it on GitHub https://github.com/ai-robots-txt/ai.robots.txt/issues/40#issuecomment-2416416076, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAXF2OHKT6RNN5AHFC7NY3Z3Y6PTAVCNFSM6AAAAABOWU5CTWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJWGQYTMMBXGY . You are receiving this because you commented.Message ID: <ai-robots-txt/ai .@.***>
Ah yes, you're right. I made the mistake of thinking the list was in alphabetical order. Apologies.
It's probably worth alphabetising them; as the list grows, duplicates are more likely.
Could be a github pre-commit command that sorts / uniques the list? 🤷🏻♂️
EDIT: I realise this is massively off topic, though.
Back to the thread point... isn't recommending blocking this bot a risky move as it could cause websites to lose rich social media embedding (eg image and other OpenGraph data)?
I would say the priority of this project is to block AI crawlers. It is not clear that facebookexternalhit gathers data for AI training, but we don't know it isn't either. I would personally vote to keep this in the list. The possible downsides seem negligible from my perspective. Any websites that really need not to block that user agent don't have to.
This particular one is a tricky one as it is a very aggressive crawler... but, unlike some of them, a lot of our customers would likely notice if their website suddenly stopped displaying article image cards.
Maybe the solution here is a comment above it describing what it does and what the risks are?
Some of these will have a much lower risk profile... But on the flip side items like this (and potentially the Google ones, too) might have unintended impact for site SEO and Social exposure for those simply copy-pasting a list in to their site to try to stop these bots from taking down the server.
Maybe the solution here is a comment above it describing what it does and what the risks are?
I like the sound of this. Would it be possible to demonstrate that these risks are real and not imaginary?
(I submitted a PR to extend the FAQ to take into account your "taking down the server" point - thanks!)
https://github.com/ai-robots-txt/ai.robots.txt/blob/6b8d7f5890d6bed722a95297996c054c210bd3b8/robots.txt#L33
If we block this in robots.txt, will we affect the functionality for when URLs are shared to facebook from the site and Facebook sends that bot to get the Open Graph data for things like title and image for the post?
Ideally I'd like to block the bot from crawling / DoS'ing the site but still allow on-demand/cached page requests for OG data when a post is shared. Facebook does not need to crawl an entire site! :)