srschaecher commented 10 months ago

In the BD_PAX_Miner.py script: Replace the user_id with the user_name in the actual backblast text. When Slack sends the full message via the API, any pax names tagged in the PAX: line (and anywhere else in the backblast text) are sent as a user id like instead of a human readable name. This results in an ugly backblast text that gets shown via the backblast view (a lot of regions are pulling from this view and showing the backblasts in their websites). Wanted to investigate whether the view could dynamically replace the user_id's with user_names, but I suspect that will have a big performance hit having to replace the values every time the view is accessed. It is probably much more efficient to just replace the text in the stored backblast upon initial processing and import.

farrellw commented 9 months ago

I agree that the dynamically replacing the user_id with the name would be unnecessarily expensive operation and could even timeout on the large beatdowns.

I feel like the second option makes sense to format it on the fly, however I see where we'd want to keep the raw backblast information for re-processing ( e.g. disaster recovery ).

What about having multiple columns? One for human readable version and one for the raw backblast.

We'd be duplicating the biggest column ( maybe 80-90% larger table size ).
IO would be more expensive on inserts, but retrievals from the database would rarely return both the raw backblast and human readable backblast.
If we are worried about insert size, then the same concept could happen with a column but maybe a time based SQL function that creates the human readable backblast from the raw backblast.

The backblast view then selects coalesce(human_readable, original_backblast). And then we could add a ticket to run that process over the existing tables and backfill the human readable.

evanpetzoldt commented 9 months ago

This code is almost instantaneous. I ran it on 10,000 backblasts and it took about a second:

users_dict = df_users[["user_id", "user_name"]].set_index("user_id").to_dict()["user_name"]

def parse_users(backblast: str, users_dict) -> str:
    """Replaces user IDs in backblast with user names

    Args:
        backblast (str): backblast text
        users_dict (Dict[str, str]): user ID to user name mapping

    Returns:
        str: formatted backblast
    """

    # Build user ID list
    pattern = "<@([A-Za-z0-9-_']+)>"
    user_ids = re.findall(pattern, backblast)

    # Replace user IDs with user names
    user_names = [users_dict.get(x, x) for x in user_ids]
    user_names = [x.replace("{", "").replace("}", "") for x in user_names]
    backblast = backblast.replace("{", "").replace("}", "")
    backblast2 = re.sub(pattern, "{}", backblast).format(*user_names)
    return backblast2

 df["backblast2"] = df["backblast"].apply(parse_users, args=(users_dict,))

evanpetzoldt commented 9 months ago

Part of me leans towards leaving the existing column as is (I think there are some regions that are parsing it for different things) and adding a column for the formatted backblast text. To @farrellw 's point it would pretty much double the space taken, but I'm not sure if that's a concern or not.

evanpetzoldt commented 9 months ago

Oh, and I forgot... we probably would get a request to do the same thing to the AO / channel tags. I added that to my code:


users_dict = df_users[["user_id", "user_name"]].set_index("user_id").to_dict()["user_name"]
aos_dict = df_aos[["channel_id", "ao"]].set_index("channel_id").to_dict()["ao"]

def parse_backblast(backblast: str, users_dict, aos_dict) -> str:
    """Replaces user and channel IDs in backblast with readable names

    Args:
        backblast (str): backblast text
        users_dict (Dict[str, str]): user ID to user name mapping
        aos_dict (Dict[str, str]): channel ID to ao name mapping

    Returns:
        str: formatted backblast
    """

    # Build user ID list
    pattern = "<@([A-Za-z0-9-_']+)>"
    user_ids = re.findall(pattern, backblast)

    # Replace user IDs with user names
    user_names = [users_dict.get(x, x) for x in user_ids]
    user_names = [x.replace("{", "").replace("}", "") for x in user_names]
    backblast = backblast.replace("{", "").replace("}", "")
    backblast2 = re.sub(pattern, "{}", backblast).format(*user_names)

    # remove channel id tags and <> from channel names
    # ex: <#C01HU2D6S77|ao-alamo-cranks-comz> -> ao-alamo-cranks-comz
    pattern = "<#([A-Za-z0-9-_'|]+)>"
    channel_ids = re.findall(pattern, backblast2)
    if len(channel_ids) > 0:
        try:
            channel_names = [x.split("|")[1].replace(">", "") for x in channel_ids]
        except IndexError:
            channel_names = [aos_dict.get(x, x) for x in channel_ids]
            channel_names = [x.replace("<", "").replace(">", "") for x in channel_names]
        backblast2 = re.sub(pattern, "{}", backblast2).format(*channel_names)

    return backblast2

df["backblast2"] = df["backblast"].apply(parse_backblast, args=(users_dict, aos_dict,))

farrellw commented 9 months ago

@evanpetzoldt When I mentioned it would be expensive and could potentially timeout, I was referring to including it on the attendance_view.

To my knowledge, MySQL doesn’t have a ‘regex_extract_all’ type function. It allows only a single regex extract at a time. So it would be a funky loop of regex and holding the pointer in memory or starting from the beginning That + running on the MySQL server rather than running on slack’s server.

Do websites query the DB directly, or through an API? If through an API, I’d say the API should do this replacing on the fly. If querying the DB directly, then I also like the idea of adding a column more than any other option. The largest negative is that the PAX names and AO names are locked in time at the moment the human readable form was created. The channel changes in slack, but the human readable form doesn’t change. To mitigate we could update the table once a period ( month? ), re-running this for all backblasts.

I agree it’s doable in Python with the code you provided, and a nice touch to replace the AO as well.

evanpetzoldt commented 9 months ago

Gotcha. Yeah, I've messed around with trying to do this in a SQL view, and it's messy and extremely slow like you say. Regions do pull straight from the db, so trying to work it into a SQL view doesn't seem like a good idea even if we could.

I agree on creating an extra column. Honestly, I don't think the "locked in" aspect would be a problem 99.9% of the time.

mdsimoneau commented 9 months ago

+1 for adding the column. We query the DB directly from our website.

farrellw commented 9 months ago

Who pays for the AWS servers? I imagine, while the cost isn't too high, that person probably gets final say as the table size roughly doubles.

I'm happy to implement it ( prolly next week ) if we all agree.

farrellw commented 9 months ago

Btw, took advantage of a buddy's MySQL knowledge at work today. If we switch it up and go the view route, this should do it ( more database transactions, but more flexible, less migrating of the existing stuff ).

create function readable_backblast (bblast LONGTEXT) returns LONGTEXT
BEGIN
    DECLARE updated_text LONGTEXT default "";
    DECLARE count int default 1;
    DECLARE userid varchar(15) default "";
    DECLARE username varchar(100) default "";
    DECLARE useridSubstring varchar(15) default "";
    DECLARE CONTINUE HANDLER FOR SQLEXCEPTION
    BEGIN 
        select count + 1 into count;
    END;

    select bblast INTO updated_text;

    loop1: LOOP     
        select REGEXP_SUBSTR(updated_text, '<@U.{10}>', 1, count) into userid;

        if userid is null then 
            leave loop1;
        end if;

        select SUBSTRING(userid, 3, 11) into useridSubstring;

        select ANY_VALUE(user_name) into username FROM users u WHERE u.user_id  = useridSubstring;

        if username is not null then
            select REGEXP_REPLACE(updated_text, userid, username) into updated_text;
        else
            select count + 1 into count;
        end if;     
    end LOOP loop1;

    return updated_text;
END;

farrellw commented 9 months ago

This issue has been open for awhile, so I’m going to summarize here to help us decide.

Let’s entirely throw out the idea of forcing people to do it on their own client / api side, and only consider the two options that allow people to query the backblast view and just get a better result back.

Add a column to the backblast table
- Steps:
  - Update each DB to add the column.
  - Adjust the Paxminer/Slackblast code to substitute the username in place of the userid.
  - Adjust each view to coalesce(readable_bb, backblast) in each database.
  - Backfill the old data.
- Open Questions: How are the table schemas managed? Just by looking, I assume they aren’t looking at a shared definition file but kind of independently copied over? No idea though someone can clue me in.
- Pros: Reads remain the same as far as database operations are concerned.
- Cons: Data is locked in a point in time, so e.g. name changes won’t reflect accurately. Any update to the script would mean the function acting on backlists is changing respective to when the backlists are ingested.
Add a function call to the attendance view to substitute the user_ids on the fly.
- Pros:
  - Function is written already, pretty much can exactly take the comment above me, so probably maybe half hour of work to test it out. Half hour of work to implement it across all regions.
- Cons: Probably slower performance? Even though it’s one database call from the website still and the data being returned to the website is the same, there are more internal database operations. Do we have data on how many calls we actually receive? While there will be more database queries inside the function, I'm not sure if its that impactful.

I don’t really have a strong opinion. Even though I originally suggested #1, #2 has its benefits too. It’s super easy to implement and iterate on, and it doesn’t make any drastic changes to the DB structure and doesn’t preclude just moving onto option #1 if it’s not working out. Because of that'd I'd vote option number two. But if we want to do option #1 as a group we can, it’s just more moving pieces so gonna be more coordination and take time to implement.

evanpetzoldt commented 9 months ago

My personal opinion would be to go with #1. The function you posted above is super cool, but I would worry about runtimes and resource pull on the database. I don't think the Con you listed for #1 is that big of a deal, IMHO.

On your question about schemas, yes they are essentially copied over. @srschaecher uses a template + SQL management studio to do it, for the few I've done I just have a SQL script.

I'd let @srschaecher cast the deciding vote :)

evanpetzoldt commented 9 months ago

I mean, another option would be to just overwrite the existing backblast field with readable names, which you had mentioned at the top. This would be cleaner, but would cause issues for any regions scraping user_ids in the backblast text (of which I don't think there's many)

srschaecher commented 8 months ago

Sorry guys, I've been out of pocket for a bit - super swamped so just was ignoring all things F3. not good form, I know. My first thought was a simple design like @evanpetzoldt suggested right at the end. Simply doing a dynamic replace of the Backblast text AFTER parsing out all of the user IDs. In the PAX_DB_Miner.py script, this would be the 'backblast' variable in bd_df dataframe. Basically, at ~row 374 could add a function to update the backblast text with a user name lookup.

farrellw commented 8 months ago

I always like retaining a copy of the raw source itself. We don't have any use case for keeping it right now, but it closes doors if we don't retain that raw source holding the user id.

Here's a ( contrived ) example where we don't want to use someone's slack username straight up in a replace.

We expand to a nation dashboard, and we want to search for usernames with their home region prefixes concatted on the end to avoid the constant confusion of the 10 Beakers in 10 different regions. Two in California, two in St. Louis, etc. Someone selecting Beaker from a dropdown list is getting way more backblasts texts that they attended than they want. (edited)

farrellw commented 3 months ago

Did we ever figure a path on this? @srschaecher @evanpetzoldt . I've been pretty disconnected from this project for a couple months but happy to jump in and implement if we want to.

It sounds like most people want to modify the backblast itself in place, saving only the modified copy. Aye?

We can do that, I'll withdraw my reservations about keeping a raw copy of the data :). I would partially suggest that instead of fully replacing the user id with the username, maybe we store it like username[userid] or something like that with an idea that it can be reversible.