Accept timestamps and only process a subset of a video as an event

smai-f commented 2 years ago

Feature Description

The ability to transcribe only a portion of a video and have it be considered an event.

Use Case

We are scraping a state legislature site for bill hearings and many of the videos are ~5-6 hours long with discussions about several bills in one video. The timestamps of when the bill discussions happen are programmatically available. We want to be able to use the same video multiple times, but with different timestamp ranges, so the result is several Events each only with a slice of the same mp4, and no duplicate processing happens.

Solution

Add something like a start_timestamp / stop_timestamp to the EventIngestionModel. When these values are present for the event, the backend should only process and transcribe the given video & audio range.

The event frontend should also only render this segment of the whole video (potentially covered by recent hackathon efforts?)

Alternatives

whargrove commented 2 years ago

@smai-f Would we add the timestamps to Session? Closer to the video_uri seems clearer to me.

Furthermore, what will the type/format of the timestamps? Something like FFmpeg's Time Duration Syntax seems reasonable here.

evamaxfield commented 2 years ago

I agree that this would be valuable. This is a large change and would require some frontend work too.

From my "big picture perspective":

Accept a transcription_start_time and transcription_end_time that accepts Optional[str] in HH:MM:SS in the Session object.
If one or both of those params are present, chunk the audio to that length and send just that chunk to transcription.
Fortunately we already have the concept of "multiple transcripts per session" in the case that we want to retranscribe something with a better model or something so this behavior would be allowable without too much database model change BUT similar to the session change, the DATABASE Transcript model would likely need to be updated to have a in_session_start_time and in_session_end_time or similar (preferably as a float in seconds or something) to make it easier to stitch multiple transcripts together.
On the frontend, we again also have a concept of multiple transcripts per event so this is just fetching all the transcripts for the same session that are chunks and stitching them in the same order.
In the transcript format, we have a concept of "annotations" and if of use to you all it may be nice to store an annotation of what the section is SectionAnnotation -- if there is a nice section name for it (which would add another parameter to the Session object probably)

sidenote on the analysis side: downstream I should probably also write a function to stitch transcripts together prior to analysis.

evamaxfield commented 2 years ago

cc @tohuynh @BrianL3 if you have thoughts on this / let me know if my comments are wrong.

I think the query to get the session transcripts would be difficult but possible

tohuynh commented 2 years ago

I don't think you'd need to change anything on the front end. If I understand correctly, this is a request on the backend to create several Events from one council meeting. (With each Event having its own video, with the correct time slice and transcript.)

evamaxfield commented 2 years ago

Hmmmm doing all of this on the Event may indeed make it easier to implement. It's not as "archivally" clean. Because technically all of these chunks are from the same event but I see your point.

evamaxfield commented 2 years ago

I was saying to make each chunk its own Session not Event.

evamaxfield commented 2 years ago

Thought about this over night. I think To is right regardless of Session or Event being used as the place to put data IF we do the session handling correctly. Nothing else on the front end or backend would need to change.

Iirc, you basically want to transcribe just portions of certain bill discussions. If you make the session index the same as the minutes item index then over time the sessions will automatically be put into correct order.

I.e. we may start out with a "session 20" for as the only session for the event but then it may be "session 14" and "session 20".

evamaxfield commented 2 years ago

One minor thing -- we need to include the start and end times in the session to it's hash generation function only if they are present.

tohuynh commented 2 years ago

I think To is right regardless of Session or Event being used as the place to put data IF we do the session handling correctly. Nothing else on the front end or backend would need to change.

I can see arguments for Session or Event. With Session, all sessions are together on one page. But at the same time, if a user searches "Bill 123", they would want to go straight to the page for Bill 123, without having to navigate the different Sessions of an Event page.

The other way is to navigate to the first Session whose transcript contains the query "Bill 123" when the user clicks on a search result.

evamaxfield commented 2 years ago

Again. I agree. However I think I would want to just implement a different search to make that possible in our current infrastructure too. That sounds useful in general.

Iirc the original reason for this request as well was to link directly to the event. Which is still possible with that share at time point because the share at time point includes session

smai-f commented 2 years ago

One minor nice to have would be if an end time isn't provided, process from the start time until the end of the video so one needn't find the end time in the scraper.

chrisjkhan commented 2 years ago

If one or both of those params are present, chunk the audio to that length and send just that chunk to transcription.

@smai-f I think that was the intention

evamaxfield commented 2 years ago

In short:

There are no technical limitations to adding this functionality. On the backend, the frontend, or downstream data analysis.

I think we agree that we could ingest this data on the Session object like so:

class Session:
    # prior attrs ...
    transcription_start_time: Optional[str] = None
    transcription_end_time: Optional[str] = None

Additionally the hashing function / id generation function for the Session should maybe updated (this is a larger concern -- we want to keep the same session ids for previously processed sessions if possible but allow this behavior to store unique sessions with different start and end times for the same video -- if we can't / can't do this safely then no worries we have previously shipped hashing function fixes before and could do it again).

Finally the actual clipping of the video/audio should be added to the pipeline.

If that is work is done, merged, and released then great. The scraper (and what you return) is ultimately up to you all.

If you want to keep the sessions linked together into the same event then use the same body and event datetime.

Pros:

"nicer provenance" / sessions are linked
its the standard CDP experience

If you want them to be different events then the body name can be the same or different but you must use a different event datetime.

Pros:

likely better search because more specificity in indexing
less minutes items per event (since I would assume you would really just have a single minutes item for the section in question)

I see pros to both sides and I think I have been talked down from my hard-line position by @tohuynh hahaha. It really depends on what you are interested in.

Regardless of decision on this. I think we should implement a search page that allows indexing per sentence of a transcript and on query returns an event link but with a timestamp to the sentence time point as @tohuynh pointed out would be useful.

CouncilDataProject / cdp-backend