Expanded ideas on what analysis to do + time estimates for each:
NGram viewer / keywords over time
Specifically for the paper pick out keywords from the past year that have been "hot" and compare their timelines between each council. Can we find any specific events and meeting outcomes related and add points to the plot that relate too those? i.e. plot for "police budget" has a general line for all the mentions of that ngram but also has specific points for legislation / votes related to missing middle housing. -- 3 days
We can pull google search trends data for the same ngram specific to a single city or multiple cities (maybe county), how do search trends compare to the council meeting discussion trends? -- +3 days
Sentiment comparison
Pull out a keyword from the last year ("policing", "police budget", "police reform", etc.) and see if we can see general trends in sentiment over time for a single council. 3 days
compare sentiment between Seattle and King County because one encapsulates the other. I.e. does seattle say "we will build more housing" while king county says "we cannot build more housing..." how does sentiment scale (?) basically. +3 days
Similar to the n-gram viewer, can we find specific points where legislation / votes mentioned the ngram and add to plot but more so to specifically find events where sentiment changed.... Seattle city council pledged in 2020 to critically evaluate the police budget and by 2021 they increased the budget, how did they discuss police budgeting over those two events. +7 days
Word error rate and word substitution
Used closed caption files as the ground truth for Seattle and build a system that will generate transcripts with GCP then compare and find differences between the ground truth closed captions and the generated transcripts. Specifically word-error-rate and also substitution tables. The substitution tables is what I care about more, when a word is incorrectly replaced, what words is GCP choosing in place of the correct word (is it rude or harmful language)? -- Would need to position this as "transcription for administrative and political language" -- 5 days
Speakerbox and who is speaking
show off speakerbox as a general tool for training speaker classification models
use speakerbox across all of seattle data for 2021 and calculate who has the most speaking time, for which committees, when voting vs non-voting, etc. -- 7 days
Outline from last quarter: https://www.overleaf.com/project/61f9910b0b519759f8b1feb2
Expanded ideas on what analysis to do + time estimates for each:
NGram viewer / keywords over time
Sentiment comparison
Word error rate and word substitution
Speakerbox and who is speaking