Orbit-Media-Studios / wo-scripts

A collection of utility scripts used by Orbit Media's Website Optimization team.
5 stars 0 forks source link

performance improvement opportunity #1

Open coreynorthcutt opened 3 hours ago

coreynorthcutt commented 3 hours ago

When processing 20GB+ of logs, several of these scripts can take some time: between 4-8 hours on a modern Macbook Pro.

It's been suggested that the Aho-Corasick automaton algorithm (pyahocorasick) could cut that time down by as much as 10x. I probably won't get around to trying it until the next time I need one of these scripts.

coreynorthcutt commented 3 hours ago

Sample code from ChatGPT:


import ahocorasick

def create_bot_automaton(bot_list):
    # Create an Aho-Corasick automaton
    automaton = ahocorasick.Automaton()

    # Add each bot string from the list to the automaton
    for bot in bot_list:
        automaton.add_word(bot.lower(), bot)

    # Finalize the automaton for searching
    automaton.make_automaton()

    return automaton

def is_bot(user_agent, automaton):
    # Check if any bot pattern matches in the user-agent
    for _, _ in automaton.iter(user_agent.lower()):
        return True
    return False

# Example usage:
bot_list = ["googlebot", "bingbot", "baiduspider", "yandexbot"]
automaton = create_bot_automaton(bot_list)

user_agent = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
print(is_bot(user_agent, automaton))  # Output: True```