Open evamaxfield opened 3 years ago
This could also be rolled in with LVN topic clustering on timeline: live example.
And related is some very old work on "events over time": https://github.com/CouncilDataProject/seattle_v1/blob/master/projects/quick_analysis.ipynb
More ideas copied from Slack:
- timestamped minutes items / "the youtube chapters idea" -- youtube videos introduced a "chapters" features where if in the description of the video, the author attaches timestamps and then text it will create chapters that can be hovered over the play bar. I don't have an example on me but basically its a similar idea to just "timestamping the minutes items" of our meetings. To achieve this, we would need a function to run during event ingestion that tries to match up minutes items to sentences in the transcript. I.e. "these 50 sentences relate to minutes item 1 and these next 70 to minutes item 2". because all of our sentences are timestamped, we can get timestamped event minutes items using this method. (to store this value, we can store the time in seconds of the minutes item on the EventMinutesItem model)
- during event ingestion, if we also run a sentiment analysis over the whole event + each "chapter" or "minutes item transcript section", so that we have an overall sentiment for the meeting and for each minutes item. This can be stored in a new table (and in the transcript as an annotation maybe). these both culminate in making the legislation page have a bit more functionality. My thinking: "Lets make it easy for someone to enter the site and search for 'legislation about upzoning the city', 'legislation for increasing parks funding for social programs', etc."
- The user flow then would be search (for legislation), for "parks funding for social programs", and be taken to relevant pieces of legislation, then when they click on one they can see the as previously discussed page of "title, abstract, status, etc" but in the tree viz of the matter history, when we link to each event we can link directly to either (or both!) the whole event OR the startpoint of the discussion on that matter. additionally, we can show how positive or negative the discussion was about the bill in that specific meeting.
I have a rough start to this with creating an embedding for each minutes item and each sentence in the transcript then running a moving window distance comparison. Find the collection of sentences that minimizes each moving window to minutes item distance.
From there we can find the "strict boundaries" of the windows by looking for trigger words. I.e. "moving on to...", "next up...", etc.
For word embeddings we have one of:
Starting prototype work here: https://github.com/JacksonMaxfield/cue-queue
Use Case
Please provide a use case to help us understand your request in context
YouTube has a "Video Chapters" feature that splits the timeline bar into chapters based off of timestamps found in the video description. Example:
Similarly, it would be incredibly useful to jump around a meeting video / transcript based off of the minutes items of the meeting.
Solution
Please describe your ideal solution
Going to take a lot of work on the backend side and a bit of work on the front-end.
We could be fancy and train a topic model or use some sort of seeded clustering, and we likely will at some point but as a first past implementation, it may be interesting to see how far the following gets us:
Look for common phrases: "Moving on to...", "Call the roll", "Attendance", etc. and apply breakpoints there. Additionally, parse all the minutes item attachments (docs, presentations, etc.) for every minutes item for an event and store the list of words UNIQUE to a specific minutes item. Then compare the transcript for those words. Find the breakpoints by taking a moving window sum of the counts of each of the unique words for a given minutes item against the transcript.
I.e.
The moving window word count would be able to see that at some point we switch from using specific words found in minutes item 1 to using specific words found in minutes item 2. If we can combine that with looking for the "section splitter sequences" ("moving on", "call the roll", etc) I think it may be a good first pass, fast and cheap chapter identifier.
Then store chapter indentifiers as annotations in the transcript for the frontend to parse.
Alternatives
Please describe any alternatives you've considered, even if you've dismissed them
Topic modeling? Clustering?
Additionally, we should let whatever pipeline we create the ability to skip this if chapter starts are provided by user as Seattle Channel event descriptions have them in most cases now-a-days.
Stakeholders
Please add any individual person or team's that should be brought in for discussion on the project
Frontend to actually make the video chapters viewer. Backend for both pipeline and transcript mutation.
Major Components
Please add any major components that need to be done for this project
Dependencies
Please add any other major or minor project dependencies here
Other Notes
Please add any extra notes here
My one concern is how to handle many-session events. We only store minutes items on the event level and not on the session level, but we will need to find a way to gracefull handle this.