CouncilDataProject / cdp-backend

Data storage utilities and processing pipelines used by CDP instances.
https://councildataproject.org/cdp-backend
Mozilla Public License 2.0
22 stars 26 forks source link

Add transcription range fields to database and ingestion models, add … #221

Closed chrisjkhan closed 1 year ago

chrisjkhan commented 1 year ago

Link to Relevant Issue

This pull request resolves #211

Description of Changes

-Add optional transcription range fields to database model -Add optional transcription range fields to ingestion model -Add validator for time duration -Add range filter to ffmpeg audio split (similar to clip functionality) -Update tests for valid inputs

codecov[bot] commented 1 year ago

Codecov Report

Merging #221 (46aff14) into main (3fb35a2) will increase coverage by 0.07%. The diff coverage is 84.05%.

@@            Coverage Diff             @@
##             main     #221      +/-   ##
==========================================
+ Coverage   72.33%   72.40%   +0.07%     
==========================================
  Files          64       64              
  Lines        3492     3541      +49     
==========================================
+ Hits         2526     2564      +38     
- Misses        966      977      +11     
Impacted Files Coverage Δ
cdp_backend/pipeline/mock_get_events.py 100.00% <ø> (ø)
cdp_backend/utils/file_utils.py 87.26% <57.14%> (-2.64%) :arrow_down:
cdp_backend/pipeline/event_gather_pipeline.py 84.70% <63.63%> (-0.76%) :arrow_down:
cdp_backend/pipeline/ingestion_models.py 99.01% <91.66%> (-0.99%) :arrow_down:
cdp_backend/database/models.py 100.00% <100.00%> (ø)
cdp_backend/database/validators.py 89.04% <100.00%> (+0.98%) :arrow_up:
cdp_backend/tests/database/test_validators.py 100.00% <100.00%> (ø)
...ckend/tests/pipeline/test_event_gather_pipeline.py 100.00% <100.00%> (ø)
cdp_backend/tests/utils/test_file_utils.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

chrisjkhan commented 1 year ago

Codewise, this all looks great. Would be curious as to how you found or contributing notes and all of that. Any places for improvement?

I made it this far! If there's anything that I'm doing that makes it appear I missed the documentation, let me know and I might have some feedback.

Implementation wise... did you have a chance to test this on a dev deployment?

No, that was the last thing I mentioned to @smai-f. Hopefully, we can get a test process going as we integrate the scraper.

I think this would transcribe the portion selected but the web UI / session interface may fail because the video doesn't match up with the audio... the best way to do this interestingly may be to split the video by timestamp and then do all the normal downstream processing rather than split the audio by timestamp...

My ideal was to do as you mentioned, and probably integrate with clip functionality. The main reason I didn't was because there seems to be a reliance on existing video hosting, and I thought that would be worth preserving.

I didn't realize how the front end functionality worked (oops!), but I think there are some easy ways to preserve both the existing video hosting and accurate timestamps. The simplest might be to send the offset to the transcribe function to alter the timestamps in the transcript: https://github.com/CouncilDataProject/cdp-backend/blob/1e5b80743c4c0374f5a7d7df9731346d61f02f55/cdp_backend/sr_models/google_cloud_sr_model.py#L199

In addition, we can update the front end so it seeks to the appropriate place on page load: https://github.com/CouncilDataProject/cdp-frontend/blob/77944572b816ed07ef0dcffd772fc19b52aea12e/src/components/Details/EventVideo/EventVideo.tsx#L133

Are you on the call this Thursday? If so it might be good to talk about this feature then. / How to dev-deploy

I should be available. Can you add me to the event?

dphoria commented 1 year ago

Beyond what Eva already mentioned, about the different video and audio stream lengths maybe being an issue, I think my only wish is that split_audio checks end_time > start_time in some way.

Thank you very much @chrisjkhan for this!

chrisjkhan commented 1 year ago

Beyond what Eva already mentioned, about the different video and audio stream lengths maybe being an issue, I think my only wish is that split_audio checks end_time > start_time in some way.

We decided at the meeting to do full video clipping and host the video. A combination of reusable functionality, relative low priority of original video hosting, simplicity of downstream issues, and overall UX. I'll change the functionality and the semantics of the whole issue.

Thank you very much @chrisjkhan for this!

You're welcome

dphoria commented 1 year ago

Beyond what Eva already mentioned, about the different video and audio stream lengths maybe being an issue, I think my only wish is that split_audio checks end_time > start_time in some way.

We decided at the meeting to do full video clipping and host the video. A combination of reusable functionality, relative low priority of original video hosting, simplicity of downstream issues, and overall UX. I'll change the functionality and the semantics of the whole issue.

Sounds great. Just let me know / hit the "re-request review" button whenever.

chrisjkhan commented 1 year ago

I may have gone a little overboard with changes trying to slightly increase consistency across method signatures. The only thing I changed that might be controversial is making the hash calculation part of an already large task rather than keeping it separate. If you really want the hash to be performant, I would suggest modifying it to use CRC/MD5/SHA-1 and do a combination of the first x bytes/blocks and file size.

I tested negative duration pairs (end_time < start_time) causing a event processing failure, as desired. Positive duration pairs cause the file to be trimmed and CDP hosted, and considered unique/distinct from full length videos. Transcription works as expected.

smai-f commented 1 year ago

Thanks everybody! Here are some manual tests using Chris' latest commit in the Montana legislature instance:

evamaxfield commented 1 year ago

Also, thank you for such great work on this! Really nicely done and thank you @smai-f for helping test!

evamaxfield commented 1 year ago

lets say the event starts at noon but you want to trim to 1 hour into the video. should the session and event start time say: "1pm"?

cc @BrianL3 @dphoria @smai-f this is a question i would love all of your opinions on

chrisjkhan commented 1 year ago

lets say the event starts at noon but you want to trim to 1 hour into the video. should the session and event start time say: "1pm"?

Ah, I missed this comment. My guess is it should still be noon. There is a session_datetime, so that value would be over constrained. I chose the very general semantics of video_start_time to leave room for any possible reasons for trimming, leaving a disconnect between event/session/etc times and video times. People more familiar with usage would know better though. If we did make this kind of link in timing, it might make sense to narrow the semantics.

The only case I can imagine is a session "starting" at noon, the video recording starting at noon, but then actual activity starting at 1pm. My guess is, if this 1 hour of dead time is in the source data being scraped, session_datetime would reflect the actual start time instead of the false start time. If dead time not in the source data, it couldn't be scraped and then trimmed.

dphoria commented 1 year ago

lets say the event starts at noon but you want to trim to 1 hour into the video. should the session and event start time say: "1pm"?

Ah, I missed this comment. My guess is it should still be noon. There is a session_datetime, so that value would be over constrained. I chose the very general semantics of video_start_time to leave room for any possible reasons for trimming, leaving a disconnect between event/session/etc times and video times. People more familiar with usage would know better though. If we did make this kind of link in timing, it might make sense to narrow the semantics.

The only case I can imagine is a session "starting" at noon, the video recording starting at noon, but then actual activity starting at 1pm. My guess is, if this 1 hour of dead time is in the source data being scraped, session_datetime would reflect the actual start time instead of the false start time. If dead time not in the source data, it couldn't be scraped and then trimmed.

First, great question Eva. :clap: I'm in line with Chris' thoughts; (I hope I'm understanding correctly) At the core, there are 2 fields aded to the model, the video start and end times (right?). At least to me, if I look at a model, and see that 'session start time' is noon, and 'video start time' is 1 PM, that would sound pretty natural. I may at first be slightly confused, but even without any external explanation, I think I would realize that this means, in this case, the video begins 1 hour into the session.

If I'm looking at an example of what's probably the vast majority of the models - 'session start time' == 'video start time', that again wouldn't throw me off, especially if I have at that point, seen an example like above when they are different.

chrisjkhan commented 1 year ago

At least to me, if I look at a model, and see that 'session start time' is noon, and 'video start time' is 1 PM, that would sound pretty natural.

Maybe there is still some confusion. video_start_time is an HH:MM:SS time/duration, rather than a datetime. It's relative to the beginning of the video. Should I do something to make this more clear?

smai-f commented 1 year ago

@evamaxfield From my part consuming this PR in the Montana legislative scraper (and speaking the time/duration language Chris is pointing out of video_start_time, not considering it a datetime):

I think the way Chris has it now makes the most sense from my narrow consuming perspective. Have the consumer decide whether or not they want to augment the session datetime to include the video start time offset or not. I can imagine reasons to support both behaviors related to the start time of the meeting aligning with a schedule written in another place.

dphoria commented 1 year ago

At least to me, if I look at a model, and see that 'session start time' is noon, and 'video start time' is 1 PM, that would sound pretty natural.

Maybe there is still some confusion. video_start_time is an HH:MM:SS time/duration, rather than a datetime. It's relative to the beginning of the video. Should I do something to make this more clear?

Thank you for pointing that out. I missed it!

evamaxfield commented 1 year ago

Thanks for all the discussion and clarification. Yes the only think I think should be changed prior to merge is adding the tiniest bit of documentation to the ingestion model that states the video time is relative to the session time.

I would just put it in the class docstring as we still need to find a better overall project documentation structure haha

chrisjkhan commented 1 year ago

I would just put it in the class docstring...

@evamaxfield Added. Let me know how it looks. I followed the Notes model in the rest of the file. It's a bit verbose.