ProbablePrime / office-hours

Office Hours hosting keeps vanishing, let's make our own.
https://officehours.probableprime.co.uk/
GNU General Public License v3.0
4 stars 1 forks source link

Automate transcription #2

Open ProbablePrime opened 1 year ago

ProbablePrime commented 1 year ago

Ideally if I PR a new audio file to this repo, it transcribes it, updates the files, re-builds and re-deploys.

I dunnol how.

ProbablePrime commented 1 year ago

https://www.linkedin.com/pulse/how-whisper-azure-machine-learning-jake-wang/ https://www.assemblyai.com/blog/how-to-run-openais-whisper-speech-recognition-model/ https://lightning.ai/pages/community/tutorial/deploy-openai-whisper/

ProbablePrime commented 1 year ago

https://docs.aws.amazon.com/transcribe/latest/dg/subtitles.html seems really cool actually.

I'd upload the audio to the office hours bucket, it would then transcribe populate the bucket with the VTT files etc and done.

Once a new data set is there we need to re-build.

Might be an idea to re-consider JS based routing again. yay

ProbablePrime commented 1 year ago

Looking more at this, we'd be looking at:

  1. Prime uploads an ogg to an S3 Bucket
  2. S3 Bucket triggers a Lambda function
  3. Lambda triggers a transcription workflow
  4. Transcription workflow writes files
  5. Somehow re-trigger a build here

For 5, I might have to move the build to use something AWS'ey as I'm not sure if we can trigger an action externally here.

ProbablePrime commented 1 year ago

I kicked off a test transcription manually via the AWS portal.

ProbablePrime commented 1 year ago

I made this simpler in #7

So now its:

  1. On new file in audio/
  2. Run transcription
  3. output to subtitles/