ai-robots-txt / ai.robots.txt

A list of AI agents and robots to block.
https://github.com/ai-robots-txt/ai.robots.txt/releases.atom
MIT License
1.23k stars 39 forks source link

Does blocking facebookexternalhit also break sharing to social media? #40

Open njt1982 opened 1 month ago

njt1982 commented 1 month ago

https://github.com/ai-robots-txt/ai.robots.txt/blob/6b8d7f5890d6bed722a95297996c054c210bd3b8/robots.txt#L33

If we block this in robots.txt, will we affect the functionality for when URLs are shared to facebook from the site and Facebook sends that bot to get the Open Graph data for things like title and image for the post?

Ideally I'd like to block the bot from crawling / DoS'ing the site but still allow on-demand/cached page requests for OG data when a post is shared. Facebook does not need to crawl an entire site! :)

cdransf commented 1 month ago

It will, per Meta's docs:

The primary purpose of FacebookExternalHit is to crawl the content of an app or website that was shared on one of Meta’s family of apps, such as Facebook, Instagram, or Messenger.

I'm ok blocking Meta's products generally, but I'd weigh how important that functionality is against the traffic that you're seeing from that crawler.

glyn commented 1 month ago

Facebook sends that bot to get the Open Graph data for things like title and image for the post

Although I don't use Facebook, I'd be surprised if sharing a link on Facebook required the crawler to run. Crawlers typically run on their own schedule rather than synchronously to a user action such as creating a post. Have you experimented by monitoring access of your website by such a crawler and then posting on Facebook and seeing if a "crawl" occurs before the post is available?

paulrudy commented 1 month ago

I've noticed that blocking facebookexternalhit prevents rich links (cards) from displaying in Apple Messages and Apple Mail (iOS and macOS). The change takes place immediately: blocking facebookexternalhit immediately prevents adding rich link, allowing immediately permits adding a rich link.

glyn commented 1 month ago

facebookexternalhit isn't in our list of AI crawlers, so isn't this discussion a bit moot?

njt1982 commented 1 month ago

facebookexternalhit isn't in our list of AI crawlers, so isn't this discussion a bit moot?

Is this not it? https://github.com/ai-robots-txt/ai.robots.txt/blob/b1491d269460ca57581c2df7cf14b3f3fc4749f3/robots.txt#L36

(if not, what's that list for?)

glyn commented 1 month ago

Ah yes, you're right. I made the mistake of thinking the list was in alphabetical order. Apologies.

On Wed, 16 Oct 2024, 11:35 Nicholas Thompson, @.***> wrote:

facebookexternalhit isn't in our list of AI crawlers, so isn't this discussion a bit moot?

Is this not it?

https://github.com/ai-robots-txt/ai.robots.txt/blob/b1491d269460ca57581c2df7cf14b3f3fc4749f3/robots.txt#L36

(if not, what's that list for?)

— Reply to this email directly, view it on GitHub https://github.com/ai-robots-txt/ai.robots.txt/issues/40#issuecomment-2416416076, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAXF2OHKT6RNN5AHFC7NY3Z3Y6PTAVCNFSM6AAAAABOWU5CTWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJWGQYTMMBXGY . You are receiving this because you commented.Message ID: <ai-robots-txt/ai .@.***>

njt1982 commented 1 month ago

Ah yes, you're right. I made the mistake of thinking the list was in alphabetical order. Apologies.

It's probably worth alphabetising them; as the list grows, duplicates are more likely.

Could be a github pre-commit command that sorts / uniques the list? 🤷🏻‍♂️

EDIT: I realise this is massively off topic, though.

Back to the thread point... isn't recommending blocking this bot a risky move as it could cause websites to lose rich social media embedding (eg image and other OpenGraph data)?

glyn commented 1 month ago

I would say the priority of this project is to block AI crawlers. It is not clear that facebookexternalhit gathers data for AI training, but we don't know it isn't either. I would personally vote to keep this in the list. The possible downsides seem negligible from my perspective. Any websites that really need not to block that user agent don't have to.

njt1982 commented 1 month ago

This particular one is a tricky one as it is a very aggressive crawler... but, unlike some of them, a lot of our customers would likely notice if their website suddenly stopped displaying article image cards.

Maybe the solution here is a comment above it describing what it does and what the risks are?

Some of these will have a much lower risk profile... But on the flip side items like this (and potentially the Google ones, too) might have unintended impact for site SEO and Social exposure for those simply copy-pasting a list in to their site to try to stop these bots from taking down the server.

glyn commented 1 month ago

Maybe the solution here is a comment above it describing what it does and what the risks are?

I like the sound of this. Would it be possible to demonstrate that these risks are real and not imaginary?

(I submitted a PR to extend the FAQ to take into account your "taking down the server" point - thanks!)