Transcriptions - Githubissues

pagdot commented 2 years ago

I've investigated doing transcriptions for JB shows in the past without reaching a satisfying conclusion yet.

What I'd expect from a decent transcription would be:

A decent correctness of the result.
- Specific terminology and names can be hard to get right. Nice to have would be to be able to correct results and feed them back to improve detection
Speaker diarisation (=recognising who speaks when)
Detecting sentences/punctuation

A few services I took a look at:

Service/framework	quality	Speaker diarisation	punctuation
Youtube transcription (exported with youtube-dl)	B	❌	❌
IBM Watson	C	✅	❌
SpeechBrain	?	✅	?
DeepSpeech	?	❌	❌
AssemblyAI	B-	✅	✅
Whisper by OpenAI (medium model)	A	❌	✅
`pyannote.audio`	❌	✅	❌

I tested a combination of Youtube and IBM watson (free tier) in the past: https://gist.github.com/pagdot/3b39187c6e0ca18dedd1f1108338855f

The result was... ok. Not great, but better than nothing.

In my google collab, I further found a test with DeepSpeech by Mozilla

If anyone is interested in also taking a look, Google Colab is great way to test in on a big GPU offered by Google and there often example projects either by the projects themselves or the community for Colab.

Either way a platform to run the transcription on in production would be required and maybe even a way to contribute in their quality. Could imagine pushing the results in this or another git repository, so that the community can make PRs with fixes

Edit:

2022-08-18: Fixed youtube entry in table (sadly it has no punctuation); added entry for Assembly AI

gerbrent commented 2 years ago

this is really amazing, thanks for all this!! Perhaps dreams do come true.....

elreydetoda commented 2 years ago

This is really cool!

Also, I don't know if it helps at all, but I also listen to this podcast: https://talkpython.fm/friends-of-the-show/sponsors

And they talk about how they use: https://talkpython.fm/assemblyai

Also, for a transcript format @thechangelog has a really cool way they visualize it (but theirs is all done by hand (from my understanding)): https://changelog.com/backstage/24#transcript

And they have all their stuff on GH as well: https://github.com/thechangelog/transcripts/blob/master/backstage/backstage-24.md

They also had an episode about a tool that one of their members made to help with replacing words in bulk: https://changelog.com/backstage/21

pagdot commented 2 years ago

@elreydetoda

And they talk about how they use: https://talkpython.fm/assemblyai

I'll give it a try :) Pricing looks reasonable. I gave it a test drive and documented the results in https://gist.github.com/pagdot/0a5f3037c6ee2d9ba5d5088af8f9a67d

On first glance it looks better than YT + Watson (and has proper punctuation). When taking a deeper look, both have issues at different places. This could maybe improved by combining the result of multiple engines somehow? Would need a lot of work though.

I think if punctuation is important, it does it and YT and Watson don't, else the YT + Watson seem to be a bit better else.

Also, for a transcript format https://github.com/thechangelog has a really cool way they visualize it (but theirs is all done by hand (from my understanding)): https://changelog.com/backstage/24#transcript

Looks nice, although IMO it is very stretched out/takes a lot of space. Personally I'd prefer something more compact :)

elreydetoda commented 2 years ago

Ya, I could see that. I'm curious of they do that to ensure if you go to share it it'll properly show the text for the transcript when someone navigates to it. 🤔

Like this link you can get by clicking on one of their names: https://changelog.com/backstage/24#transcript-22

pagdot commented 2 years ago

could imagine it IRC/text based where the person is on the left of the transcript

elreydetoda commented 2 years ago

Also, changelog has a Practical AI podcast which might have some more solutions for this. I haven't listened to a lot, but it is pretty good and they talk about some pretty cool technical details. They might have an alternative we could use in their back catalog: https://changelog.com/search?q=practical+ai+text

gerbrent commented 2 years ago

Whisper by OpenAI

Robust Speech Recognition via Large-Scale Weak Supervision

https://github.com/openai/whisper

FlakM commented 2 years ago

Hello yall! I've been looking into #26 and wanted to test the search using the transcription. I wanted to post my findings here.

I've previously worked with deep speech from Mozilla but I've found out that the project has been deprecated. Currently, the go to solution seems to be Coqui STT. I've managed to get the full transcription of the example episode (Linux Action News 264) using rust program you can read the resulting transcript here: output.txt

My conclusions:

generic language model is not good enough - should be definitely fine-tuned to beloved geeky language (companies names, lingo like systemd etc). It seems like the language model is good enough but there is also larger models
inference on my dell xps 15 (i7) takes 0.7 of the audio length and takes one core so probably i could speed it out by using multiple threads for different recordings
currently, the example runs the whole file in a single run, before feeding stt model there should be a speaker recognition pipeline to add nice timestamps and then use streaming recognition on separate chunks
since I think you have mentioned that you have separate tracks for each speaker we should have enough data to train some model to recognize who is speaking and also annotate the transcript (with guest# fallback)

Please let me know if you are interested in investigating it further. Maybe some deep-learning nerd would love to join?

gerbrent commented 2 years ago

quickly: this sounds amazing @FlakM and yes - lets chase this further!

FlakM commented 2 years ago

Over the weekend I've added vad splitting and instructions on how to run the inference. I've also created couple of issues in the repository if someone wants to help.

The biggest problem now is building a dedicated language model (scorer) more suited to the domain. If you know of some good resources with a large amount of easily downloadable text about linux and technology please leave a comment in the issue :+1:

BTW the instructions are relatively straightforward for someone who has dealt with python. If someone wants to pick this issue up I would be happy to help! :100:

EDIT:

I've found an amazing library with bindings to whisper API https://github.com/ggerganov/whisper.cpp Somehow I've missed the info that it has been published by open ai not as a web interface. It handles the VAD, language model and punctuation in one go...

There are a couple of versions of models:

tiny - works fast 18s/60s audio produces quite good transcription
base - slower 33s/60s audio quite comparable transcription
medium - sloooow 426s/60s audio but the quality is outstanding transcription

The idea, for now, is to run base on a larger size (all?) of files and see how it works with search ;)

gerbrent commented 2 years ago

These transcriptions are impressively accurate! Nice work! We'll see what we can do to drum up a little more interest in helping here.... very promising!!

gerbrent commented 2 years ago

Just to clarify, sounds like if using Whisper we won't need to tackle the arduous task of "building a dedicated language model" - that sounds like great news!

Correct?? From what I can tell, even the base transcription is impressively accurate.

pagdot commented 2 years ago

Just wow on the medium model. If we could add speaker diarization to it, it would be pretty great. This could even be done by a different model and merged afterwards. Maybe a take a look at it in close the future :)

FlakM commented 2 years ago

@gerbrent @pagdot precisely! I've got to admit that I was left speechless by the transcription quality (pan intended).

I have even more good news! I've managed to compile the code on my wife's m1 mac and it is actually super quick! Just 12s for 60s of audio using a medium model. With some corners cut I'm pretty close to preparing a full-fledged solution.

Do you have any idea how would you like to consume/present this data?

gerbrent commented 2 years ago

that is a great question, worthy of much thought and intention.

I believe @noblepayne might be a great resource here.

let us get back to you..

pagdot commented 2 years ago

@FlakM Have to give it a try. I could imagine using a second framework for speaker diarisation. And merging the results similar to how I've done it in my first experiment. Finding time and motivation to work on it isn't that easy right now for me :(

FlakM commented 2 years ago

@pagdot this sounds like a great plan. Also no pressure on the timeline, It is a perfectly separated feature :+1: Working without motivation and free time sucks... Take good care of yourself! :four_leaf_clover:

pagdot commented 2 years ago

@FlakM I'm fine :) Just a combination of work a few hobbies and spending time having fun with friends takes most of my time and energy right now :) I'm just a bit drained in the remaining spare time. Thanks for caring about my wellbeing :heart: I'm doing good, just not easy right now investing time and energy here too.

By working as a software developer, I often prioritize non dev work in my spare time :)

FlakM commented 2 years ago

That's good to hear, you never know on the Internet.

pagdot commented 2 years ago

Returning to the topic. Just found a very interesting conversation for whisper describing and discussing how to combine whisper with pyannotate.audio. I'll try to get it working in the evening. I probably will do this in python using a script or a notebook on collab

@FlakM With your continued contributions and research you've really helped me to get motivated again to work on this subject ❤️

FlakM commented 2 years ago

@pagdot you can find the output that jupiter-search is currently able to generate (tiny model on laptop) BlueIsTheNewRedCoderRadio331.log.

Please review if you think that I should add something. Otherwise, I'll be working on the weekend to run the whole back catalog on some beefy machine on linode :+1:

gerbrent commented 2 years ago

     "stop": 3300,
        "text": " Hi everybody and welcome to Coder Radio.[_TT_1300]"
      },
      {
        "start": 3300,
        "stop": 4000,
        "text": "[_BEG_] I'm Roberta Broadcasting's Weekly Talk Show taking a pragmatic look at the art and...

Roberta Broadcasting ; )

gerbrent commented 2 years ago

Does it make any sense @FlakM to run the base/medium quality on the back catalogue? And if so, can we help provide some Linode goodness for the task?

pagdot commented 2 years ago

Just gave a sample Collab Notebook containing whisper (medium.en model) and pyannotate.audio a try and adapted it: https://colab.research.google.com/drive/1WoStXlztP3Lv0jQy5w-zupK2VtlKvXuZ?usp=sharing

I've uploaded the result to https://pag.one/jb/coderRadio331.html for everyone to take a look :)

It seems to have issues with jar jar and with chris and mikes speech overlapping or almost overlapping. Pretty sure there are a few ways to improve the result, but something like it would be a good starting point at least and good enough for now/better than nothing.

I'd need to take a look if pyannotate would support a global speaker state, so you would only need to set new speakers per episode, but not chris, mike, wes, brent and all other regular speakers.

Instead of the html I'd generate a json file per episode which would be easy to digest by further tools, so visualization could be adapted without having to run everything again

FlakM commented 2 years ago

@pagdot this looks very cool. Did I understand correctly, the model is not able to classify specific speakers using some kind of transfer learning and downstream task tuning. So we have no way to map classes (speaker_1/speaker_2) to the hosts without manual intervention. And the classes can be different on each inference (probably chronological)? Followup question, how comfortable are you with pytorch and deep learning? :P

@gerbrent I have a first unpolished version of the code that is able to utilize whatever we are able to throw at it. But maybe it's worth considering if we won't rather wait for coming up with including speaker recognition. If the costs of running it through the entire back catalog again isn't a problem we can do it twice :shrug:

Rough estimates:

There are 1204 episodes (could have sworn it was 2k yesterday) in the all shows RSS feed. Let's assume they are 1h each and that we are using medium model.

total audio len: 1204h speedup: x2 (medium model according to web site) let's assume we need 6 threads for single audio inference and the medium model takes 5GB RAM Let's assume we take a dedicated 48 core CPU with 96GB of RAM ($1.08 per h) So we get x8 speedup (can run up to 8 workers) with plenty of RAM to spare. It gives a 75h of processing.

And this is provided that fireside hosting works perfectly, I didn't do some dumb mistake, and the speedup of the model might be bigger or smaller depending on the CPU itself (it was much lower on my old dell). The exact time could be better estimated by running on the machine.

One last thing: once I polish the code there is still the online side of things (new episodes). Did you get the chance to discuss how are we planning to mixin the transcriptions into the site/ecosystem? I can suggest two solutions:

Push model - before or after preparing the episode you (/your pipeline) run some cli tool that will prepare the transcript and possibly push the data to all relevant places (git repository for hosting on web page, RSS, meilisearch index for search etc). No hosting the server side component.
Pull model - some backend component watches the rss feed and on new episodes does the same thing as the above. More complex but flexible solution.

PS: I've listened to office hours and felt like a star, thank you! I'm glad to give back a little <3

pagdot commented 2 years ago

@FlakM Would expect there to be a way to detect speakers across multiple inferences, just didn't invest time yet to find out how it would work. I haven't worked with PyTorch on my own yet. I've tried working with machine learning/deep learning before. I'm able to copy paste code pieces, but not interested or any good in doing real work myself. I have a hard time grasping it and prefer to just use it from a user perspective or integrate it into other code as a developer, but not customizing dealing with how it works.

FlakM commented 2 years ago

@pagdot same here... Using the same data between different interferences must involve having a synchronized access to some shared data so it will definitely limit the parallelism. But maybe it would be only a short critical section.

pagdot commented 1 year ago

There is https://github.com/pyannote/pyannote-audio/issues/391 on the topic of a unique speaker id

pagdot commented 1 year ago

Want to drop https://github.com/m-bain/whisperX. It uses a different stt model for accurate timestamps and whisper for actual stt

Got 2 weeks off after christmas. I plan to work a bit on this topic :)

FlakM commented 1 year ago

@pagdot I'm building a new pc next week and will probably also come back to this topic to test the new hardware. Could you please tell what's the use case of more precise timings? Recently folks reported significant speedup of cpu inference using some fancy post training optimisations https://github.com/openai/whisper/discussions/454

pagdot commented 1 year ago

Timings, would be just a nice to have so that timestamps will be better. No advantage besides that. Have to take a look at the cpu inference optimisation. Haven't had the time yet to work at this. Had more tech issues than expected to deal with.

FlakM commented 1 year ago

@pagdot I know the pain... I've refreshed my project and it seems that there is ongoing work to improve timestamps in the whisper model itself which points to a draft that might solve the underlying issue also there is work to incorporate GPU support using nvblas - it seems like a super strong library :+1:

The only thing missing is speaker recognition. It would definitely be quite cool to show a UI like you did previously but maybe srt (timestamps and text) style would be enough?

pagdot commented 1 year ago

It would definitely be quite cool to show a UI like you did previously but maybe srt (timestamps and text) style would be enough?

I didn't do the UI, just copied it. I'm not that good with (web) ui design, so I'd start by creating a intermediate json or something file, which could be used in another step

FlakM commented 1 year ago

Ok so I finally got my pc built up :partying_face: and here is a package of the latest episodes transcriptions using medium.en model: example_transcriptions.tar.gz. Single file also attached for convenience on the mobile Jellyfin January | Self-Hosted 87.txt. Let's just say that it has been a decent load test for my hardware:

podcast2text

I thought of mimicking the transcription page of Embedded.fm, they use paid service rev.com (1,5$ per minute! :dagger: ) but our presentation layer could be very similar (for now without speaker identification for the initial version)

We could either include it on the main episode site below Episode links section or a separate URL ie https://www.jupiterbroadcasting.com/show/self-hosted/88/transcripts but in general, have it as a static resource generated from json by hugo in repo to enable open source contributions to fix some errors in whisper transcriptions (we can even have a button with edit that will launch straight to gitithub online editor).

Just like embedded.fm if the transcript is not yet ready we can show a message 'it will be available soon'. As a bonus we can serve the same transcriptions in a dedicated transcriptions tag in podcast 2.0 RSS spec and as a cherry on top link the timestamp to a specific moment in web player if it is possible <3

Also notice that the transcriptions files have additional metadata that I can scrape from the mp3 itself (images of the episodes, duration, chapters, authors etc etc). I parse it to enable smarter search in the future but maybe it can be used somehow to enhance the website? Since I need to download the files for transcriptions anyway I can do some processing now if you have some ideas.

@gerbrent @pagdot I could use some guidance on how to proceed. For now, I plan to finish running transcription on the rest of the episodes using medium.en model. Then I can take it further and build some POC - but honestly, I'd prefer to help someone else with it and proceed with the search based on the transcriptions. Maybe you have someone in mind eager to ship an awesome feature?

pagdot commented 1 year ago

I've got stuck because I'm no UI/UX engineer and not that good with html/css in general. The sample i posted above was copied/adapted from an example. The other snag was I'd prefer to extract a speaker id which could be used by a person more familiar with AI to cluster the speaker ids from multiple episodes, so you wouldn't have to map it on each episode.

pynnotate doesn't expose it in it's current form afaik though. May take a look at it and add episode specific speaker IDs at least. Pretty sure if the community works together, it should be possible to map them in a reasonable amount of time.

Recently i dug into another project of mine, but maybe I find motivation again to finish up to some degree here :)

My next steps would be:

Generate a json per episode similar to the one by @FlakM, but with speaker ids (if possible, ones who would allow clustering over multiple episodes)
Generate in a second step a html file with a chat like view showing the transcriptions
link the timestamps from the transcription to timestamps of some sort of embedded media player (podverse, youtube, ... depends how you could link them)

FlakM commented 1 year ago

I think given how little time I have to spend on this we should leave the speaker id as a separate issue. I feel like it's super valuable but between my day job and two kids, I have very little time to investigate it RN. What do you think about opening a new separate issue?

As for ui/ux I might prepare some rough first draft MR, maybe someone will pick it up

FlakM commented 1 year ago

Ok so I've created some initial work to include transcripts in the website in my branch: https://github.com/FlakM/jupiterbroadcasting.com/tree/transcripts it should definitely be cleaned up by someone but it's a starting point :)

transcripts

elreydetoda commented 1 year ago

Some convo around transcriptions (don't have time/competency to summarize :grin:):

start of convo & end of convo
someone offering an HPC cluster to help with transcriptions, not for future episodes but at least on backlog (start of convo & end of convo)

cinimodev commented 1 year ago

Hi everyone, I got a chance to briefly speak with @gerbrent at the Mt. Vernon meet-up and I wanted to offer whatever I can to help with the transcriptions. I am very hard-of-hearing and wear hearing aids in both ears. I catch about 90% of what Chris says, 80% of Wes, and about 60% of Brent (sorry, Brent). I have degenerative hearing loss, so I will continue to lose my hearing until deafness at some point. I've been listening to LUP since about episode 50 and used to hear everything clear as day.

I am not that technical (former marketing, so know my around HTML/CSS and can write killer Facebook posts). I know I can help with feedback, proofreading, UI/UX.

cinimodev commented 1 year ago

Here is what is important for me as a hearing disabled person:

Timestamps. I will cross-reference the timestamp in the podcatcher with the timestamp in the transcript so I know I'm in the right spot.
Speaker recognition. I very often have a hard time discerning who is speaking. Identifying the speaker is really important for context on the content.
I will very rarely read along while listening to a podcast since we can all read faster than we listen. Captions/subtitles on audio content is painful because it feels like it moves at a snails pace. Transcripts are better because I can pause the show and reference the transcript for whatever I missed. This is why speaker recognition and timestamps are so important.
Ideally the transcript lands the same time as the show hits the feed. I know this is hard. But, for hearing disabled people we want to be able to participate like everyone else. A few days/week delay means that most poeple have moved on and we don't get a chance to join in the discussions.
Correctly translating acronyms, insider terms, company names is a low priority. Even using something like Rev.com gets this stuff wrong and is expected. I can get the gist of what is being talked about and expect these terms to be hit/miss.

JupiterBroadcasting / jupiterbroadcasting.com

Transcriptions #301