Closed smai-f closed 1 year ago
@smai-f Would we add the timestamps to Session? Closer to the video_uri
seems clearer to me.
Furthermore, what will the type/format of the timestamps? Something like FFmpeg's Time Duration Syntax seems reasonable here.
I agree that this would be valuable. This is a large change and would require some frontend work too.
From my "big picture perspective":
transcription_start_time
and transcription_end_time
that accepts Optional[str]
in HH:MM:SS
in the Session
object.session
change, the DATABASE Transcript
model would likely need to be updated to have a in_session_start_time
and in_session_end_time
or similar (preferably as a float in seconds or something) to make it easier to stitch multiple transcripts together.Session
object probably)sidenote on the analysis side: downstream I should probably also write a function to stitch transcripts together prior to analysis.
cc @tohuynh @BrianL3 if you have thoughts on this / let me know if my comments are wrong.
I think the query to get the session transcripts would be difficult but possible
I don't think you'd need to change anything on the front end. If I understand correctly, this is a request on the backend to create several Event
s from one council meeting. (With each Event
having its own video, with the correct time slice and transcript.)
Hmmmm doing all of this on the Event may indeed make it easier to implement. It's not as "archivally" clean. Because technically all of these chunks are from the same event but I see your point.
I was saying to make each chunk its own Session not Event.
Thought about this over night. I think To is right regardless of Session or Event being used as the place to put data IF we do the session handling correctly. Nothing else on the front end or backend would need to change.
Iirc, you basically want to transcribe just portions of certain bill discussions. If you make the session index the same as the minutes item index then over time the sessions will automatically be put into correct order.
I.e. we may start out with a "session 20" for as the only session for the event but then it may be "session 14" and "session 20".
One minor thing -- we need to include the start and end times in the session to it's hash generation function only if they are present.
I think To is right regardless of Session or Event being used as the place to put data IF we do the session handling correctly. Nothing else on the front end or backend would need to change.
I can see arguments for Session or Event. With Session, all sessions are together on one page. But at the same time, if a user searches "Bill 123", they would want to go straight to the page for Bill 123, without having to navigate the different Sessions of an Event page.
The other way is to navigate to the first Session whose transcript contains the query "Bill 123" when the user clicks on a search result.
Again. I agree. However I think I would want to just implement a different search to make that possible in our current infrastructure too. That sounds useful in general.
Iirc the original reason for this request as well was to link directly to the event. Which is still possible with that share at time point because the share at time point includes session
One minor nice to have would be if an end time isn't provided, process from the start time until the end of the video so one needn't find the end time in the scraper.
If one or both of those params are present, chunk the audio to that length and send just that chunk to transcription.
@smai-f I think that was the intention
In short:
There are no technical limitations to adding this functionality. On the backend, the frontend, or downstream data analysis.
I think we agree that we could ingest this data on the Session object like so:
class Session:
# prior attrs ...
transcription_start_time: Optional[str] = None
transcription_end_time: Optional[str] = None
Additionally the hashing function / id generation function for the Session
should maybe updated (this is a larger concern -- we want to keep the same session ids for previously processed sessions if possible but allow this behavior to store unique sessions with different start and end times for the same video -- if we can't / can't do this safely then no worries we have previously shipped hashing function fixes before and could do it again).
Finally the actual clipping of the video/audio should be added to the pipeline.
If that is work is done, merged, and released then great. The scraper (and what you return) is ultimately up to you all.
If you want to keep the sessions linked together into the same event then use the same body and event datetime.
Pros:
If you want them to be different events then the body name can be the same or different but you must use a different event datetime.
Pros:
I see pros to both sides and I think I have been talked down from my hard-line position by @tohuynh hahaha. It really depends on what you are interested in.
Regardless of decision on this. I think we should implement a search page that allows indexing per sentence of a transcript and on query returns an event link but with a timestamp to the sentence time point as @tohuynh pointed out would be useful.
Feature Description
The ability to transcribe only a portion of a video and have it be considered an event.
Use Case
We are scraping a state legislature site for bill hearings and many of the videos are ~5-6 hours long with discussions about several bills in one video. The timestamps of when the bill discussions happen are programmatically available. We want to be able to use the same video multiple times, but with different timestamp ranges, so the result is several Events each only with a slice of the same mp4, and no duplicate processing happens.
Solution
Add something like a start_timestamp / stop_timestamp to the
EventIngestionModel
. When these values are present for the event, the backend should only process and transcribe the given video & audio range.The event frontend should also only render this segment of the whole video (potentially covered by recent hackathon efforts?)
Alternatives