Open pagdot opened 2 years ago
this is really amazing, thanks for all this!! Perhaps dreams do come true.....
This is really cool!
Also, I don't know if it helps at all, but I also listen to this podcast: https://talkpython.fm/friends-of-the-show/sponsors
And they talk about how they use: https://talkpython.fm/assemblyai
Also, for a transcript format @thechangelog has a really cool way they visualize it (but theirs is all done by hand (from my understanding)): https://changelog.com/backstage/24#transcript
And they have all their stuff on GH as well: https://github.com/thechangelog/transcripts/blob/master/backstage/backstage-24.md
They also had an episode about a tool that one of their members made to help with replacing words in bulk: https://changelog.com/backstage/21
@elreydetoda
And they talk about how they use: https://talkpython.fm/assemblyai
I'll give it a try :) Pricing looks reasonable. I gave it a test drive and documented the results in https://gist.github.com/pagdot/0a5f3037c6ee2d9ba5d5088af8f9a67d
On first glance it looks better than YT + Watson (and has proper punctuation). When taking a deeper look, both have issues at different places. This could maybe improved by combining the result of multiple engines somehow? Would need a lot of work though.
I think if punctuation is important, it does it and YT and Watson don't, else the YT + Watson seem to be a bit better else.
Also, for a transcript format https://github.com/thechangelog has a really cool way they visualize it (but theirs is all done by hand (from my understanding)): https://changelog.com/backstage/24#transcript
Looks nice, although IMO it is very stretched out/takes a lot of space. Personally I'd prefer something more compact :)
Ya, I could see that. I'm curious of they do that to ensure if you go to share it it'll properly show the text for the transcript when someone navigates to it. 🤔
Like this link you can get by clicking on one of their names: https://changelog.com/backstage/24#transcript-22
could imagine it IRC/text based where the person is on the left of the transcript
Also, changelog has a Practical AI podcast which might have some more solutions for this. I haven't listened to a lot, but it is pretty good and they talk about some pretty cool technical details. They might have an alternative we could use in their back catalog: https://changelog.com/search?q=practical+ai+text
Whisper by OpenAI
Robust Speech Recognition via Large-Scale Weak Supervision
Hello yall! I've been looking into #26 and wanted to test the search using the transcription. I wanted to post my findings here.
I've previously worked with deep speech from Mozilla but I've found out that the project has been deprecated. Currently, the go to solution seems to be Coqui STT. I've managed to get the full transcription of the example episode (Linux Action News 264) using rust program you can read the resulting transcript here: output.txt
My conclusions:
Please let me know if you are interested in investigating it further. Maybe some deep-learning nerd would love to join?
quickly: this sounds amazing @FlakM and yes - lets chase this further!
Over the weekend I've added vad splitting and instructions on how to run the inference. I've also created couple of issues in the repository if someone wants to help.
The biggest problem now is building a dedicated language model (scorer) more suited to the domain. If you know of some good resources with a large amount of easily downloadable text about linux and technology please leave a comment in the issue :+1:
BTW the instructions are relatively straightforward for someone who has dealt with python. If someone wants to pick this issue up I would be happy to help! :100:
EDIT:
I've found an amazing library with bindings to whisper API https://github.com/ggerganov/whisper.cpp Somehow I've missed the info that it has been published by open ai not as a web interface. It handles the VAD, language model and punctuation in one go...
There are a couple of versions of models:
The idea, for now, is to run base on a larger size (all?) of files and see how it works with search ;)
These transcriptions are impressively accurate! Nice work! We'll see what we can do to drum up a little more interest in helping here.... very promising!!
Just to clarify, sounds like if using Whisper we won't need to tackle the arduous task of "building a dedicated language model" - that sounds like great news!
Correct?? From what I can tell, even the base transcription is impressively accurate.
Just wow on the medium model. If we could add speaker diarization to it, it would be pretty great. This could even be done by a different model and merged afterwards. Maybe a take a look at it in close the future :)
@gerbrent @pagdot precisely! I've got to admit that I was left speechless by the transcription quality (pan intended).
I have even more good news! I've managed to compile the code on my wife's m1 mac and it is actually super quick! Just 12s for 60s of audio using a medium model. With some corners cut I'm pretty close to preparing a full-fledged solution.
Do you have any idea how would you like to consume/present this data?
that is a great question, worthy of much thought and intention.
I believe @noblepayne might be a great resource here.
let us get back to you..
@FlakM Have to give it a try. I could imagine using a second framework for speaker diarisation. And merging the results similar to how I've done it in my first experiment. Finding time and motivation to work on it isn't that easy right now for me :(
@pagdot this sounds like a great plan. Also no pressure on the timeline, It is a perfectly separated feature :+1: Working without motivation and free time sucks... Take good care of yourself! :four_leaf_clover:
@FlakM I'm fine :) Just a combination of work a few hobbies and spending time having fun with friends takes most of my time and energy right now :) I'm just a bit drained in the remaining spare time. Thanks for caring about my wellbeing :heart: I'm doing good, just not easy right now investing time and energy here too.
By working as a software developer, I often prioritize non dev work in my spare time :)
That's good to hear, you never know on the Internet.
Returning to the topic. Just found a very interesting conversation for whisper describing and discussing how to combine whisper with pyannotate.audio. I'll try to get it working in the evening. I probably will do this in python using a script or a notebook on collab
@FlakM With your continued contributions and research you've really helped me to get motivated again to work on this subject ❤️
@pagdot you can find the output that jupiter-search
is currently able to generate (tiny model on laptop)
BlueIsTheNewRedCoderRadio331.log.
Please review if you think that I should add something. Otherwise, I'll be working on the weekend to run the whole back catalog on some beefy machine on linode :+1:
"stop": 3300,
"text": " Hi everybody and welcome to Coder Radio.[_TT_1300]"
},
{
"start": 3300,
"stop": 4000,
"text": "[_BEG_] I'm Roberta Broadcasting's Weekly Talk Show taking a pragmatic look at the art and...
Roberta Broadcasting ; )
Does it make any sense @FlakM to run the base/medium quality on the back catalogue? And if so, can we help provide some Linode goodness for the task?
Just gave a sample Collab Notebook containing whisper (medium.en model) and pyannotate.audio a try and adapted it: https://colab.research.google.com/drive/1WoStXlztP3Lv0jQy5w-zupK2VtlKvXuZ?usp=sharing
I've uploaded the result to https://pag.one/jb/coderRadio331.html for everyone to take a look :)
It seems to have issues with jar jar and with chris and mikes speech overlapping or almost overlapping. Pretty sure there are a few ways to improve the result, but something like it would be a good starting point at least and good enough for now/better than nothing.
I'd need to take a look if pyannotate would support a global speaker state, so you would only need to set new speakers per episode, but not chris, mike, wes, brent and all other regular speakers.
Instead of the html I'd generate a json file per episode which would be easy to digest by further tools, so visualization could be adapted without having to run everything again
@pagdot this looks very cool. Did I understand correctly, the model is not able to classify specific speakers using some kind of transfer learning and downstream task tuning. So we have no way to map classes (speaker_1
/speaker_2
) to the hosts without manual intervention. And the classes can be different on each inference (probably chronological)?
Followup question, how comfortable are you with pytorch and deep learning? :P
@gerbrent I have a first unpolished version of the code that is able to utilize whatever we are able to throw at it. But maybe it's worth considering if we won't rather wait for coming up with including speaker recognition. If the costs of running it through the entire back catalog again isn't a problem we can do it twice :shrug:
Rough estimates:
There are 1204 episodes (could have sworn it was 2k yesterday) in the all shows RSS feed. Let's assume they are 1h each and that we are using medium model.
total audio len: 1204h speedup: x2 (medium model according to web site) let's assume we need 6 threads for single audio inference and the medium model takes 5GB RAM Let's assume we take a dedicated 48 core CPU with 96GB of RAM ($1.08 per h) So we get x8 speedup (can run up to 8 workers) with plenty of RAM to spare. It gives a 75h of processing.
And this is provided that fireside hosting works perfectly, I didn't do some dumb mistake, and the speedup of the model might be bigger or smaller depending on the CPU itself (it was much lower on my old dell). The exact time could be better estimated by running on the machine.
One last thing: once I polish the code there is still the online side of things (new episodes). Did you get the chance to discuss how are we planning to mixin the transcriptions into the site/ecosystem? I can suggest two solutions:
PS: I've listened to office hours and felt like a star, thank you! I'm glad to give back a little <3
@FlakM Would expect there to be a way to detect speakers across multiple inferences, just didn't invest time yet to find out how it would work. I haven't worked with PyTorch on my own yet. I've tried working with machine learning/deep learning before. I'm able to copy paste code pieces, but not interested or any good in doing real work myself. I have a hard time grasping it and prefer to just use it from a user perspective or integrate it into other code as a developer, but not customizing dealing with how it works.
@pagdot same here... Using the same data between different interferences must involve having a synchronized access to some shared data so it will definitely limit the parallelism. But maybe it would be only a short critical section.
There is https://github.com/pyannote/pyannote-audio/issues/391 on the topic of a unique speaker id
Want to drop https://github.com/m-bain/whisperX. It uses a different stt model for accurate timestamps and whisper for actual stt
Got 2 weeks off after christmas. I plan to work a bit on this topic :)
@pagdot I'm building a new pc next week and will probably also come back to this topic to test the new hardware. Could you please tell what's the use case of more precise timings? Recently folks reported significant speedup of cpu inference using some fancy post training optimisations https://github.com/openai/whisper/discussions/454
Timings, would be just a nice to have so that timestamps will be better. No advantage besides that. Have to take a look at the cpu inference optimisation. Haven't had the time yet to work at this. Had more tech issues than expected to deal with.
@pagdot I know the pain... I've refreshed my project and it seems that there is ongoing work to improve timestamps in the whisper model itself which points to a draft that might solve the underlying issue also there is work to incorporate GPU support using nvblas - it seems like a super strong library :+1:
The only thing missing is speaker recognition. It would definitely be quite cool to show a UI like you did previously but maybe srt (timestamps and text) style would be enough?
It would definitely be quite cool to show a UI like you did previously but maybe srt (timestamps and text) style would be enough?
I didn't do the UI, just copied it. I'm not that good with (web) ui design, so I'd start by creating a intermediate json or something file, which could be used in another step
Ok so I finally got my pc built up :partying_face: and here is a package of the latest episodes transcriptions using medium.en model: example_transcriptions.tar.gz. Single file also attached for convenience on the mobile Jellyfin January | Self-Hosted 87.txt. Let's just say that it has been a decent load test for my hardware:
I thought of mimicking the transcription page of Embedded.fm, they use paid service rev.com (1,5$ per minute! :dagger: ) but our presentation layer could be very similar (for now without speaker identification for the initial version)
We could either include it on the main episode site below Episode links
section or a separate URL ie https://www.jupiterbroadcasting.com/show/self-hosted/88/transcripts but in general, have it as a static resource generated from json by hugo in repo to enable open source contributions to fix some errors in whisper transcriptions (we can even have a button with edit that will launch straight to gitithub online editor).
Just like embedded.fm if the transcript is not yet ready we can show a message 'it will be available soon'. As a bonus we can serve the same transcriptions in a dedicated transcriptions tag in podcast 2.0 RSS spec and as a cherry on top link the timestamp to a specific moment in web player if it is possible <3
Also notice that the transcriptions files have additional metadata that I can scrape from the mp3 itself (images of the episodes, duration, chapters, authors etc etc). I parse it to enable smarter search in the future but maybe it can be used somehow to enhance the website? Since I need to download the files for transcriptions anyway I can do some processing now if you have some ideas.
@gerbrent @pagdot I could use some guidance on how to proceed. For now, I plan to finish running transcription on the rest of the episodes using medium.en
model. Then I can take it further and build some POC - but honestly, I'd prefer to help someone else with it and proceed with the search based on the transcriptions. Maybe you have someone in mind eager to ship an awesome feature?
I've got stuck because I'm no UI/UX engineer and not that good with html/css in general. The sample i posted above was copied/adapted from an example. The other snag was I'd prefer to extract a speaker id which could be used by a person more familiar with AI to cluster the speaker ids from multiple episodes, so you wouldn't have to map it on each episode.
pynnotate doesn't expose it in it's current form afaik though. May take a look at it and add episode specific speaker IDs at least. Pretty sure if the community works together, it should be possible to map them in a reasonable amount of time.
Recently i dug into another project of mine, but maybe I find motivation again to finish up to some degree here :)
My next steps would be:
I think given how little time I have to spend on this we should leave the speaker id as a separate issue. I feel like it's super valuable but between my day job and two kids, I have very little time to investigate it RN. What do you think about opening a new separate issue?
As for ui/ux I might prepare some rough first draft MR, maybe someone will pick it up
Ok so I've created some initial work to include transcripts in the website in my branch: https://github.com/FlakM/jupiterbroadcasting.com/tree/transcripts it should definitely be cleaned up by someone but it's a starting point :)
Some convo around transcriptions (don't have time/competency to summarize :grin:):
Hi everyone, I got a chance to briefly speak with @gerbrent at the Mt. Vernon meet-up and I wanted to offer whatever I can to help with the transcriptions. I am very hard-of-hearing and wear hearing aids in both ears. I catch about 90% of what Chris says, 80% of Wes, and about 60% of Brent (sorry, Brent). I have degenerative hearing loss, so I will continue to lose my hearing until deafness at some point. I've been listening to LUP since about episode 50 and used to hear everything clear as day.
I am not that technical (former marketing, so know my around HTML/CSS and can write killer Facebook posts). I know I can help with feedback, proofreading, UI/UX.
Here is what is important for me as a hearing disabled person:
I've investigated doing transcriptions for JB shows in the past without reaching a satisfying conclusion yet.
What I'd expect from a decent transcription would be:
A few services I took a look at:
pyannote.audio
I tested a combination of Youtube and IBM watson (free tier) in the past: https://gist.github.com/pagdot/3b39187c6e0ca18dedd1f1108338855f
The result was... ok. Not great, but better than nothing.
In my google collab, I further found a test with DeepSpeech by Mozilla
If anyone is interested in also taking a look, Google Colab is great way to test in on a big GPU offered by Google and there often example projects either by the projects themselves or the community for Colab.
Either way a platform to run the transcription on in production would be required and maybe even a way to contribute in their quality. Could imagine pushing the results in this or another git repository, so that the community can make PRs with fixes
Edit:
2022-08-18: Fixed youtube entry in table (sadly it has no punctuation); added entry for Assembly AI