CouncilDataProject / cdp-roadmap

The public roadmap for CDP.
MIT License
10 stars 0 forks source link

Sequential Topic Segmentation / Session Chapters #9

Open evamaxfield opened 3 years ago

evamaxfield commented 3 years ago

Use Case

Please provide a use case to help us understand your request in context

YouTube has a "Video Chapters" feature that splits the timeline bar into chapters based off of timestamps found in the video description. Example:

chapter-example

Similarly, it would be incredibly useful to jump around a meeting video / transcript based off of the minutes items of the meeting.

Solution

Please describe your ideal solution

Going to take a lot of work on the backend side and a bit of work on the front-end.

We could be fancy and train a topic model or use some sort of seeded clustering, and we likely will at some point but as a first past implementation, it may be interesting to see how far the following gets us:

Look for common phrases: "Moving on to...", "Call the roll", "Attendance", etc. and apply breakpoints there. Additionally, parse all the minutes item attachments (docs, presentations, etc.) for every minutes item for an event and store the list of words UNIQUE to a specific minutes item. Then compare the transcript for those words. Find the breakpoints by taking a moving window sum of the counts of each of the unique words for a given minutes item against the transcript.

I.e.

"minutes_item_1": ["municipal", "broadband", "light"],
"minutes_item_2": ["it", "department"],
lets talk about the municipal broadband bill that would enable seattle city light to serve customers with broadband...
...
moving on to funding the seattle IT department...
...

The moving window word count would be able to see that at some point we switch from using specific words found in minutes item 1 to using specific words found in minutes item 2. If we can combine that with looking for the "section splitter sequences" ("moving on", "call the roll", etc) I think it may be a good first pass, fast and cheap chapter identifier.

Then store chapter indentifiers as annotations in the transcript for the frontend to parse.

Alternatives

Please describe any alternatives you've considered, even if you've dismissed them

Topic modeling? Clustering?

Additionally, we should let whatever pipeline we create the ability to skip this if chapter starts are provided by user as Seattle Channel event descriptions have them in most cases now-a-days.

Stakeholders

Please add any individual person or team's that should be brought in for discussion on the project

Frontend to actually make the video chapters viewer. Backend for both pipeline and transcript mutation.

Major Components

Please add any major components that need to be done for this project

Dependencies

Please add any other major or minor project dependencies here

Other Notes

Please add any extra notes here

My one concern is how to handle many-session events. We only store minutes items on the event level and not on the session level, but we will need to find a way to gracefull handle this.

evamaxfield commented 3 years ago

This could also be rolled in with LVN topic clustering on timeline: live example.

image

evamaxfield commented 3 years ago

And related is some very old work on "events over time": https://github.com/CouncilDataProject/seattle_v1/blob/master/projects/quick_analysis.ipynb

evamaxfield commented 3 years ago

More ideas copied from Slack:

  1. timestamped minutes items / "the youtube chapters idea" -- youtube videos introduced a "chapters" features where if in the description of the video, the author attaches timestamps and then text it will create chapters that can be hovered over the play bar. I don't have an example on me but basically its a similar idea to just "timestamping the minutes items" of our meetings. To achieve this, we would need a function to run during event ingestion that tries to match up minutes items to sentences in the transcript. I.e. "these 50 sentences relate to minutes item 1 and these next 70 to minutes item 2". because all of our sentences are timestamped, we can get timestamped event minutes items using this method. (to store this value, we can store the time in seconds of the minutes item on the EventMinutesItem model)
  2. during event ingestion, if we also run a sentiment analysis over the whole event + each "chapter" or "minutes item transcript section", so that we have an overall sentiment for the meeting and for each minutes item. This can be stored in a new table (and in the transcript as an annotation maybe). these both culminate in making the legislation page have a bit more functionality. My thinking: "Lets make it easy for someone to enter the site and search for 'legislation about upzoning the city', 'legislation for increasing parks funding for social programs', etc."
  3. The user flow then would be search (for legislation), for "parks funding for social programs", and be taken to relevant pieces of legislation, then when they click on one they can see the as previously discussed page of "title, abstract, status, etc" but in the tree viz of the matter history, when we link to each event we can link directly to either (or both!) the whole event OR the startpoint of the discussion on that matter. additionally, we can show how positive or negative the discussion was about the bill in that specific meeting.
evamaxfield commented 3 years ago

I have a rough start to this with creating an embedding for each minutes item and each sentence in the transcript then running a moving window distance comparison. Find the collection of sentences that minimizes each moving window to minutes item distance.

From there we can find the "strict boundaries" of the windows by looking for trigger words. I.e. "moving on to...", "next up...", etc.

For word embeddings we have one of:

evamaxfield commented 3 years ago

Starting prototype work here: https://github.com/JacksonMaxfield/cue-queue