ai-robots-txt / ai.robots.txt

A list of AI agents and robots to block.
https://github.com/ai-robots-txt/ai.robots.txt/releases.atom
MIT License
1.07k stars 34 forks source link

Order entries alphanumerically and case-insensitively #44

Open glyn opened 2 hours ago

glyn commented 2 hours ago
          > Ah yes, you're right. I made the mistake of thinking the list was in alphabetical order. Apologies.

It's probably worth alphabetising them; as the list grows, duplicates are more likely.

Could be a github pre-commit command that sorts / uniques the list? 🤷🏻‍♂️

EDIT: I realise this is massively off topic, though.

Back to the thread point... isn't recommending blocking this bot a risky move as it could cause websites to lose rich social media embedding (eg image and other OpenGraph data)?

Originally posted by @njt1982 in https://github.com/ai-robots-txt/ai.robots.txt/issues/40#issuecomment-2417342818

glyn commented 2 hours ago

I think robots.json should be ordered alphanumerically by user agent as well as the list of use agents in robots.txt. This reduces the risk of duplication as the list grows, but it would also avoid slips like I made above (I didn't see an entry because I assumed the list was already in alphanumeric order).

A slight nit here is that the list may need other than a straight alphanumeric ordering so that upper and lowercase user agents are placed alongside each other.

For example, rather than the order:

...
FacebookBot
FriendlyCrawler
GPTBot
...
facebookexternalhit
...

it may be better to ignore case and order these as:

...
FacebookBot
facebookexternalhit
FriendlyCrawler
GPTBot
...
njt1982 commented 27 minutes ago

Agreed RE case insensitive sorting.