Closed srschaecher closed 2 months ago
I agree that the dynamically replacing the user_id with the name would be unnecessarily expensive operation and could even timeout on the large beatdowns.
I feel like the second option makes sense to format it on the fly, however I see where we'd want to keep the raw backblast information for re-processing ( e.g. disaster recovery ).
What about having multiple columns? One for human readable version and one for the raw backblast.
The backblast view then selects coalesce(human_readable, original_backblast). And then we could add a ticket to run that process over the existing tables and backfill the human readable.
This code is almost instantaneous. I ran it on 10,000 backblasts and it took about a second:
users_dict = df_users[["user_id", "user_name"]].set_index("user_id").to_dict()["user_name"]
def parse_users(backblast: str, users_dict) -> str:
"""Replaces user IDs in backblast with user names
Args:
backblast (str): backblast text
users_dict (Dict[str, str]): user ID to user name mapping
Returns:
str: formatted backblast
"""
# Build user ID list
pattern = "<@([A-Za-z0-9-_']+)>"
user_ids = re.findall(pattern, backblast)
# Replace user IDs with user names
user_names = [users_dict.get(x, x) for x in user_ids]
user_names = [x.replace("{", "").replace("}", "") for x in user_names]
backblast = backblast.replace("{", "").replace("}", "")
backblast2 = re.sub(pattern, "{}", backblast).format(*user_names)
return backblast2
df["backblast2"] = df["backblast"].apply(parse_users, args=(users_dict,))
Part of me leans towards leaving the existing column as is (I think there are some regions that are parsing it for different things) and adding a column for the formatted backblast text. To @farrellw 's point it would pretty much double the space taken, but I'm not sure if that's a concern or not.
Oh, and I forgot... we probably would get a request to do the same thing to the AO / channel tags. I added that to my code:
users_dict = df_users[["user_id", "user_name"]].set_index("user_id").to_dict()["user_name"]
aos_dict = df_aos[["channel_id", "ao"]].set_index("channel_id").to_dict()["ao"]
def parse_backblast(backblast: str, users_dict, aos_dict) -> str:
"""Replaces user and channel IDs in backblast with readable names
Args:
backblast (str): backblast text
users_dict (Dict[str, str]): user ID to user name mapping
aos_dict (Dict[str, str]): channel ID to ao name mapping
Returns:
str: formatted backblast
"""
# Build user ID list
pattern = "<@([A-Za-z0-9-_']+)>"
user_ids = re.findall(pattern, backblast)
# Replace user IDs with user names
user_names = [users_dict.get(x, x) for x in user_ids]
user_names = [x.replace("{", "").replace("}", "") for x in user_names]
backblast = backblast.replace("{", "").replace("}", "")
backblast2 = re.sub(pattern, "{}", backblast).format(*user_names)
# remove channel id tags and <> from channel names
# ex: <#C01HU2D6S77|ao-alamo-cranks-comz> -> ao-alamo-cranks-comz
pattern = "<#([A-Za-z0-9-_'|]+)>"
channel_ids = re.findall(pattern, backblast2)
if len(channel_ids) > 0:
try:
channel_names = [x.split("|")[1].replace(">", "") for x in channel_ids]
except IndexError:
channel_names = [aos_dict.get(x, x) for x in channel_ids]
channel_names = [x.replace("<", "").replace(">", "") for x in channel_names]
backblast2 = re.sub(pattern, "{}", backblast2).format(*channel_names)
return backblast2
df["backblast2"] = df["backblast"].apply(parse_backblast, args=(users_dict, aos_dict,))
@evanpetzoldt When I mentioned it would be expensive and could potentially timeout, I was referring to including it on the attendance_view.
To my knowledge, MySQL doesn’t have a ‘regex_extract_all’ type function. It allows only a single regex extract at a time. So it would be a funky loop of regex and holding the pointer in memory or starting from the beginning That + running on the MySQL server rather than running on slack’s server.
Do websites query the DB directly, or through an API? If through an API, I’d say the API should do this replacing on the fly. If querying the DB directly, then I also like the idea of adding a column more than any other option. The largest negative is that the PAX names and AO names are locked in time at the moment the human readable form was created. The channel changes in slack, but the human readable form doesn’t change. To mitigate we could update the table once a period ( month? ), re-running this for all backblasts.
I agree it’s doable in Python with the code you provided, and a nice touch to replace the AO as well.
Gotcha. Yeah, I've messed around with trying to do this in a SQL view, and it's messy and extremely slow like you say. Regions do pull straight from the db, so trying to work it into a SQL view doesn't seem like a good idea even if we could.
I agree on creating an extra column. Honestly, I don't think the "locked in" aspect would be a problem 99.9% of the time.
+1 for adding the column. We query the DB directly from our website.
Who pays for the AWS servers? I imagine, while the cost isn't too high, that person probably gets final say as the table size roughly doubles.
I'm happy to implement it ( prolly next week ) if we all agree.
Btw, took advantage of a buddy's MySQL knowledge at work today. If we switch it up and go the view route, this should do it ( more database transactions, but more flexible, less migrating of the existing stuff ).
create function readable_backblast (bblast LONGTEXT) returns LONGTEXT
BEGIN
DECLARE updated_text LONGTEXT default "";
DECLARE count int default 1;
DECLARE userid varchar(15) default "";
DECLARE username varchar(100) default "";
DECLARE useridSubstring varchar(15) default "";
DECLARE CONTINUE HANDLER FOR SQLEXCEPTION
BEGIN
select count + 1 into count;
END;
select bblast INTO updated_text;
loop1: LOOP
select REGEXP_SUBSTR(updated_text, '<@U.{10}>', 1, count) into userid;
if userid is null then
leave loop1;
end if;
select SUBSTRING(userid, 3, 11) into useridSubstring;
select ANY_VALUE(user_name) into username FROM users u WHERE u.user_id = useridSubstring;
if username is not null then
select REGEXP_REPLACE(updated_text, userid, username) into updated_text;
else
select count + 1 into count;
end if;
end LOOP loop1;
return updated_text;
END;
This issue has been open for awhile, so I’m going to summarize here to help us decide.
Let’s entirely throw out the idea of forcing people to do it on their own client / api side, and only consider the two options that allow people to query the backblast view and just get a better result back.
Add a column to the backblast table
Add a function call to the attendance view to substitute the user_ids on the fly.
I don’t really have a strong opinion. Even though I originally suggested #1, #2 has its benefits too. It’s super easy to implement and iterate on, and it doesn’t make any drastic changes to the DB structure and doesn’t preclude just moving onto option #1 if it’s not working out. Because of that'd I'd vote option number two. But if we want to do option #1 as a group we can, it’s just more moving pieces so gonna be more coordination and take time to implement.
My personal opinion would be to go with #1. The function you posted above is super cool, but I would worry about runtimes and resource pull on the database. I don't think the Con you listed for #1 is that big of a deal, IMHO.
On your question about schemas, yes they are essentially copied over. @srschaecher uses a template + SQL management studio to do it, for the few I've done I just have a SQL script.
I'd let @srschaecher cast the deciding vote :)
I mean, another option would be to just overwrite the existing backblast field with readable names, which you had mentioned at the top. This would be cleaner, but would cause issues for any regions scraping user_ids in the backblast text (of which I don't think there's many)
Sorry guys, I've been out of pocket for a bit - super swamped so just was ignoring all things F3. not good form, I know. My first thought was a simple design like @evanpetzoldt suggested right at the end. Simply doing a dynamic replace of the Backblast text AFTER parsing out all of the user IDs. In the PAX_DB_Miner.py script, this would be the 'backblast' variable in bd_df dataframe. Basically, at ~row 374 could add a function to update the backblast text with a user name lookup.
I always like retaining a copy of the raw source itself. We don't have any use case for keeping it right now, but it closes doors if we don't retain that raw source holding the user id.
Here's a ( contrived ) example where we don't want to use someone's slack username straight up in a replace.
We expand to a nation dashboard, and we want to search for usernames with their home region prefixes concatted on the end to avoid the constant confusion of the 10 Beakers in 10 different regions. Two in California, two in St. Louis, etc. Someone selecting Beaker from a dropdown list is getting way more backblasts texts that they attended than they want. (edited)
Did we ever figure a path on this? @srschaecher @evanpetzoldt . I've been pretty disconnected from this project for a couple months but happy to jump in and implement if we want to.
It sounds like most people want to modify the backblast itself in place, saving only the modified copy. Aye?
We can do that, I'll withdraw my reservations about keeping a raw copy of the data :). I would partially suggest that instead of fully replacing the user id with the username, maybe we store it like username[userid] or something like that with an idea that it can be reversible.
@farrellw we did not decide on this, but I'm definitely cool making a call on it. My current vote is to create an extra column for it, but would be just as happy replacing the current column
Oh sorry, just saw the second part of your comment... Like I said, I'm also happy with the replacement approach.
Also just FYI, I've been running paxminer the last week or so as Beaker's environment seems to be down and we haven't heard from him in a bit
Got it, oh yeah if we are good with adding a column I still think thats the better option. Retaining a sort of medallion architecture ( raw -> curated ) within the same record.
Sounds good, let's go with that approach. I will get a new column added to the schemas soon
I'm thinking backblast_names
, backblast_parsed
or something for the column name
@farrellw and done. New column is backblast_parsed
Okay sounds good @evanpetzoldt. I'll write it into Paxminer this afternoon. Once its running for a few days, we can run a backfill over the existing dataset so it works for historical data as well.
cool. I will work it into Slackblast as well. Are you also planning to insert the AO / channel names as well? It's not quite as big of deal since it typically has the name in the tag (something like <#C01HU2D6S77|ao-alamo-cranks-comz>
)
@farrellw also for consistency purposes, are you leaving the @ and # before the user / channel names? Ie would it show up as @Moneyball
or just Moneyball
?
@evanpetzoldt Let's remove the @ and # symbol. And yeah lets remove the channel AO in favor of the name.
Here's an MR for this on Paxminer. Works for inserts and updates. To keep it consistent between Slackbot and Paxminer, I used 99% of the same syntax you posted in this chat @evanpetzoldt back in January.
@evanpetzoldt Alright. Looks like both slackblast and paxminer are correctly inserting the parsed backblast for new backblasts.
Most people who use the website connect to the backblast view, correct?
This is the current definition for that view:
select
`B`.`bd_date` AS `Date`,
(
select
`f3stl`.`aos`.`ao`
from
`f3stl`.`aos`
where
(`f3stl`.`aos`.`channel_id` = `B`.`ao_id`)) AS `AO`,
`U1`.`user_name` AS `Q`,
(
select
`f3stl`.`users`.`user_name`
from
`f3stl`.`users`
where
(`f3stl`.`users`.`user_id` = `B`.`coq_user_id`)) AS `CoQ`,
`B`.`pax_count` AS `pax_count`,
`B`.`fngs` AS `fngs`,
`B`.`fng_count` AS `fng_count`,
`B`.`backblast` AS `backblast`
from
(`f3stl`.`beatdowns` `B`
join `f3stl`.`users` `U1` on
((`U1`.`user_id` = `B`.`q_user_id`)))
order by
`B`.`bd_date`,
(
select
`f3stl`.`aos`.`ao`
from
`f3stl`.`aos`
where
(`f3stl`.`aos`.`channel_id` = `B`.`ao_id`));
We can adjust the last line of the select statement like so, so it takes the parsed backblast if its not null, otherwise returns the normal backblast.
This would be a proposed new definition:
select
`B`.`bd_date` AS `Date`,
(
select
`f3stl`.`aos`.`ao`
from
`f3stl`.`aos`
where
(`f3stl`.`aos`.`channel_id` = `B`.`ao_id`)) AS `AO`,
`U1`.`user_name` AS `Q`,
(
select
`f3stl`.`users`.`user_name`
from
`f3stl`.`users`
where
(`f3stl`.`users`.`user_id` = `B`.`coq_user_id`)) AS `CoQ`,
`B`.`pax_count` AS `pax_count`,
`B`.`fngs` AS `fngs`,
`B`.`fng_count` AS `fng_count`,
coalesce(`B`.backblast_parsed , `B`.`backblast`) AS `backblast` -- this line was changed
from
(`f3stl`.`beatdowns` `B`
join `f3stl`.`users` `U1` on
((`U1`.`user_id` = `B`.`q_user_id`)))
order by
`B`.`bd_date`,
(
select
`f3stl`.`aos`.`ao`
from
`f3stl`.`aos`
where
(`f3stl`.`aos`.`channel_id` = `B`.`ao_id`));
View updated, backfill complete. Feels good to close this one @srschaecher , @evanpetzoldt.
In the BD_PAX_Miner.py script: Replace the user_id with the user_name in the actual backblast text. When Slack sends the full message via the API, any pax names tagged in the PAX: line (and anywhere else in the backblast text) are sent as a user id like instead of a human readable name. This results in an ugly backblast text that gets shown via the backblast view (a lot of regions are pulling from this view and showing the backblasts in their websites). Wanted to investigate whether the view could dynamically replace the user_id's with user_names, but I suspect that will have a big performance hit having to replace the values every time the view is accessed. It is probably much more efficient to just replace the text in the stored backblast upon initial processing and import.