Generate Tasks and Time Estimates for CDP Undergrad Hiring

Tasks:

Create dataset of subset of Seattle sessions. Only sessions with transcripts created from closed caption conversion are included (min: ~1 hours, max: ~2 hours)
Write function to compare a "ground truth" transcript and a "generated" transcript for WER and word selection / replacement (min: ~8 hours, max: ~16 hours)
Generate speech-to-text versions of all sessions in dataset (active time: ~4 hours)
- Dataset should now have: "session_id", "ground_truth_transcript", "speech_to_text_transcript"
Run analysis function across the dataset (min: ~2 hours, max: ~8 hours)
- Dataset should now how: "session_id", "ground_truth_transcript", "wer", "replacement_counts"
Analyze these results, whats the overall WER, are there common trends in word replacement? (min: ~16 hours, max: ~40 hours)
Create system for automated benchmarking of these results as a part of continous integration (min: ~16 hours, max: ~24 hours)

Sum Task Time:

Tasks:

Create web scraper for Boston (min ~4 hours, max: ~40 hours) -- legistar
Create web scraper for Atlanta (min ~4 hours, max: ~80 hours) -- not sure what minutes we can get... https://citycouncil.atlantaga.gov/legislation/city-council-meeting-minutes
Create web scraper for Oakland (min ~4 hours, max: ~40 hours) -- legistar
...

Basically min ~4 hours for any scraper and max ~40 hours for a Legistar scraper missing the video link, max ~80 hours for a non-Legistar scraper

Tasks:

Determine training strategy on Google Cloud (min: ~40 hours, max: ~80 hours)
Implement "production" for Seattle (min: ~40 hours, max: ~120 hours)
- Write dataset pre-evaluation function
- Write GitHub Action job to auto-train, eval, and store model
- Update Event Gather pipeline to apply model
- Write backfill function

THEN

PugetSoundClinic-PIT / project-tracking