Open coreynorthcutt opened 3 hours ago
Sample code from ChatGPT:
import ahocorasick
def create_bot_automaton(bot_list):
# Create an Aho-Corasick automaton
automaton = ahocorasick.Automaton()
# Add each bot string from the list to the automaton
for bot in bot_list:
automaton.add_word(bot.lower(), bot)
# Finalize the automaton for searching
automaton.make_automaton()
return automaton
def is_bot(user_agent, automaton):
# Check if any bot pattern matches in the user-agent
for _, _ in automaton.iter(user_agent.lower()):
return True
return False
# Example usage:
bot_list = ["googlebot", "bingbot", "baiduspider", "yandexbot"]
automaton = create_bot_automaton(bot_list)
user_agent = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
print(is_bot(user_agent, automaton)) # Output: True```
When processing 20GB+ of logs, several of these scripts can take some time: between 4-8 hours on a modern Macbook Pro.
It's been suggested that the Aho-Corasick automaton algorithm (pyahocorasick) could cut that time down by as much as 10x. I probably won't get around to trying it until the next time I need one of these scripts.