Closed FearYourSelf closed 3 months ago
Hi! One of the Sesame team members posted on twitter that it will not come with Maya and Miles.
I can’t understand their decision. Pretty obvious there will be tons of people disappointed from this. Maya and Miles are probably what everyone hoped and want to have in their own hands.
So we'll need to train our own. How should I proceed by training my own voices? I'm new to this, and I really appreciate any help you guys can give me.
We don’t know yet. They haven’t released anything so far. They will likely share a training script and some guidelines, but it might not be that easy to train a model like Maya again. They probably used a very good and curated dataset to achieve that level of “humanity” everyone is falling in love with. We’ll see.
We'll see...
https://x.com/justLV/status/1895559587509719309
They said they can't release the fine-tuned version trained on the talent's voice (presumably for copyright reasons? Idk). But I wouldn't be too concerned about not having the same level of "humanity" in the base model. From reading the blog post, the base model, which they are releasing, is what contains the "humanity". The fine-tuned version presumably just superimposes a different voice onto this.
This is my read of it. Maybe I am wrong. Let's see.
The truth is that we do not know until it is open-sourced. Most questions people ask here can be answered with that.
If you ask Maya/Miles they might tell you something along the lines of their default voice being more robotic, analytical, and with less personality. That could be somewhat true or a hallucination. Personally, I would not worry about this whatsoever until we have code in the repro. Until then it's all guesswork.
We don’t know yet. They haven’t released anything so far. They will likely share a training script and some guidelines, but it might not be that easy to train a model like Maya again. They probably used a very good and curated dataset to achieve that level of “humanity” everyone is falling in love with. We’ll see.
Do not stress mate. The OS community has more combined compute than these companies lol. We will see plenty of realistic models rivaling Maya very quickly.
We don’t know yet. They haven’t released anything so far. They will likely share a training script and some guidelines, but it might not be that easy to train a model like Maya again. They probably used a very good and curated dataset to achieve that level of “humanity” everyone is falling in love with. We’ll see.
Do not stress mate. The OS community has more combined compute than these companies lol. We will see plenty of realistic models rivaling Maya very quickly.
No, the open-source community just doesn’t have the resources to train these models from scratch at the same level as big corporations.
Fine-tuning is doable, yes, but the real challenge is the lack of high-quality data that these corporations are able to access. Maybe something like Kokoro-TTS dev is doing, where large propietary models are distilled, could help here (to a degree).
If open-source could really compete, we’d have seen models as good as CSM by now, but we haven’t. Without access to base models, progress would’ve been stuck for years, just waiting for a company to release something, just as we are now. And let’s be real, every speech-to-speech attempt so far (Moshi, GPT-Omni, Ichigo) has been a flop, interesting resarch-wise but tottaly unusable, unlike csm.
We don’t know yet. They haven’t released anything so far. They will likely share a training script and some guidelines, but it might not be that easy to train a model like Maya again. They probably used a very good and curated dataset to achieve that level of “humanity” everyone is falling in love with. We’ll see.
Do not stress mate. The OS community has more combined compute than these companies lol. We will see plenty of realistic models rivaling Maya very quickly.
No, the open-source community just doesn’t have the resources to train these models from scratch at the same level as big corporations.
Fine-tuning is doable, yes, but the real challenge is the lack of high-quality data that these corporations are able to access. Maybe something like Kokoro-TTS dev is doing, where large propietary models are distilled, could help here (to a degree).
If open-source could really compete, we’d have seen models as good as CSM by now, but we haven’t. Without access to base models, progress would’ve been stuck for years, just waiting for a company to release something, just as we are now. And let’s be real, every speech-to-speech attempt so far (Moshi, GPT-Omni, Ichigo) has been a flop, interesting resarch-wise but tottaly unusable, unlike csm.
Collectively I believe OS does but getting that coordination is hard to use all the hardware to train. Collectively, the OS community has more engineers behind it and come up with innovations like quantization which make better models run on local hardware. Though, a large company compute could go further once they also adopt that idea.
We don’t know yet. They haven’t released anything so far. They will likely share a training script and some guidelines, but it might not be that easy to train a model like Maya again. They probably used a very good and curated dataset to achieve that level of “humanity” everyone is falling in love with. We’ll see.
Do not stress mate. The OS community has more combined compute than these companies lol. We will see plenty of realistic models rivaling Maya very quickly.
No, the open-source community just doesn’t have the resources to train these models from scratch at the same level as big corporations. Fine-tuning is doable, yes, but the real challenge is the lack of high-quality data that these corporations are able to access. Maybe something like Kokoro-TTS dev is doing, where large propietary models are distilled, could help here (to a degree). If open-source could really compete, we’d have seen models as good as CSM by now, but we haven’t. Without access to base models, progress would’ve been stuck for years, just waiting for a company to release something, just as we are now. And let’s be real, every speech-to-speech attempt so far (Moshi, GPT-Omni, Ichigo) has been a flop, interesting resarch-wise but tottaly unusable, unlike csm.
Collectively I believe OS does but getting that coordination is hard to use all the hardware to train. Collectively, the OS community has more engineers behind it and come up with innovations like quantization which make better models run on local hardware. Though, a large company compute could go further once they also adopt that idea.
But how would we train a voice using this CSM? Eleven Labs? Or will it come with a fine-tuned and trained default voice? I'm sorry I'm kinda new to all of this.
We don’t know yet. They haven’t released anything so far. They will likely share a training script and some guidelines, but it might not be that easy to train a model like Maya again. They probably used a very good and curated dataset to achieve that level of “humanity” everyone is falling in love with. We’ll see.
Do not stress mate. The OS community has more combined compute than these companies lol. We will see plenty of realistic models rivaling Maya very quickly.
No, the open-source community just doesn’t have the resources to train these models from scratch at the same level as big corporations.
Fine-tuning is doable, yes, but the real challenge is the lack of high-quality data that these corporations are able to access. Maybe something like Kokoro-TTS dev is doing, where large propietary models are distilled, could help here (to a degree).
If open-source could really compete, we’d have seen models as good as CSM by now, but we haven’t. Without access to base models, progress would’ve been stuck for years, just waiting for a company to release something, just as we are now. And let’s be real, every speech-to-speech attempt so far (Moshi, GPT-Omni, Ichigo) has been a flop, interesting resarch-wise but tottaly unusable, unlike csm.
Not entirely true, mate. Open-source doesn't need to train from scratch to compete—distillation, model surgery, and dataset curation are already closing the gap. Look at what happened with LLMs: people said the same thing, yet now we have models like Mixtral and Qwen-2 outperforming some closed-source counterparts.
Kokoro-TTS dev’s approach is promising, and data constraints are real, but the OS community is very good at working around them—whether it’s synthetic data, fine-tuning, or leveraging existing foundation models in creative ways. Also, CSM’s edge isn’t just its model; it’s the dataset and post-processing tricks. Once those get replicated (and they will), you’ll see open-source alternatives catching up fast.
From my understanding, they’re essentially providing a fork of Moshi, which means we’re not training a model from scratch. Anyone with relatively modest compute can generate a pickled tensor voice—it’s not about brute-force training, but about using the underlying architecture and "acoustic tokens" to evoke natural inflections within the voice model.
And about Moshi, GPT-Omni, Ichigo—those were first attempts. It’s like saying open-source LLMs would never work because GPT-2 wasn’t as good as GPT-4. These things take iterations. If there's one thing OS has proven, it’s that it doesn’t stay behind for long.
If open-source could really compete, we’d have seen models as good as CSM by now, but we haven’t.
That's nonsense...
What kind of invented out of thin air criteria is that...
Cost of training has been in free fall for three years now, as new inventions and discoveries allow us to train for cheaper and cheaper month after month.
Deepseek R1 was pretty obvious evidence open source isn't that far behind the big corporations.
And it's not just R1, we get amazing open models all the time. In the domain of stable diffusion / image generation, and of video generation too by the way, a lot of the open models are nearly at the level, or even at some time periods above the level of the corporate models...
Open LLMs are not at the level of corporate models, but they're SO much closer than they were a year ago, which was WAY closer than they were 2 years ago, and the gap keeps on closing as computing becomes less and less important, and techniques become more and more important...
I really wouldn't be surprised if by the end of this year, we have at least one open model getting released that's above the level of corporate SOTA, at least for a time, R1 was already fitting that bill for some benchmarks...
Voice models is probably where open source is the furthest from the corporate models, but I really wouldn't be surprised if we got open SOTA models this year in that field too, look at this repo for example...
On Wed, Mar 12, 2025 at 6:16 AM Lex-au @.***> wrote:
We don’t know yet. They haven’t released anything so far. They will likely share a training script and some guidelines, but it might not be that easy to train a model like Maya again. They probably used a very good and curated dataset to achieve that level of “humanity” everyone is falling in love with. We’ll see.
Do not stress mate. The OS community has more combined compute than these companies lol. We will see plenty of realistic models rivaling Maya very quickly.
No, the open-source community just doesn’t have the resources to train these models from scratch at the same level as big corporations.
Fine-tuning is doable, yes, but the real challenge is the lack of high-quality data that these corporations are able to access. Maybe something like Kokoro-TTS dev is doing, where large propietary models are distilled, could help here (to a degree).
If open-source could really compete, we’d have seen models as good as CSM by now, but we haven’t. Without access to base models, progress would’ve been stuck for years, just waiting for a company to release something, just as we are now. And let’s be real, every speech-to-speech attempt so far (Moshi, GPT-Omni, Ichigo) has been a flop, interesting resarch-wise but tottaly unusable, unlike csm.
Not entirely true, mate. Open-source doesn't need to train from scratch to compete—distillation, model surgery, and dataset curation are already closing the gap. Look at what happened with LLMs: people said the same thing, yet now we have models like Mixtral and Qwen-2 outperforming some closed-source counterparts.
Kokoro-TTS dev’s approach is promising, and data constraints are real, but the OS community is very good at working around them—whether it’s synthetic data, fine-tuning, or leveraging existing foundation models in creative ways. Also, CSM’s edge isn’t just its model; it’s the dataset and post-processing tricks. Once those get replicated (and they will), you’ll see open-source alternatives catching up fast.
From my understanding, they’re essentially providing a fork of Moshi, which means we’re not training a model from scratch. Anyone with relatively modest compute can generate a pickled tensor voice—it’s not about brute-force training, but about using the underlying architecture and "acoustic tokens" to evoke natural inflections within the voice model.
And about Moshi, GPT-Omni, Ichigo—those were first attempts. It’s like saying open-source LLMs would never work because GPT-2 wasn’t as good as GPT-4. These things take iterations. If there's one thing OS has proven, it’s that it doesn’t stay behind for long.
— Reply to this email directly, view it on GitHub https://github.com/SesameAILabs/csm/issues/30#issuecomment-2716496705, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA2SFO7KLZN7SEGAHL7DB32T67LDAVCNFSM6AAAAABYVEBP5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMJWGQ4TMNZQGU . You are receiving this because you are subscribed to this thread.Message ID: @.**> [image: Lex-au]Lex-au* left a comment (SesameAILabs/csm#30) https://github.com/SesameAILabs/csm/issues/30#issuecomment-2716496705
We don’t know yet. They haven’t released anything so far. They will likely share a training script and some guidelines, but it might not be that easy to train a model like Maya again. They probably used a very good and curated dataset to achieve that level of “humanity” everyone is falling in love with. We’ll see.
Do not stress mate. The OS community has more combined compute than these companies lol. We will see plenty of realistic models rivaling Maya very quickly.
No, the open-source community just doesn’t have the resources to train these models from scratch at the same level as big corporations.
Fine-tuning is doable, yes, but the real challenge is the lack of high-quality data that these corporations are able to access. Maybe something like Kokoro-TTS dev is doing, where large propietary models are distilled, could help here (to a degree).
If open-source could really compete, we’d have seen models as good as CSM by now, but we haven’t. Without access to base models, progress would’ve been stuck for years, just waiting for a company to release something, just as we are now. And let’s be real, every speech-to-speech attempt so far (Moshi, GPT-Omni, Ichigo) has been a flop, interesting resarch-wise but tottaly unusable, unlike csm.
Not entirely true, mate. Open-source doesn't need to train from scratch to compete—distillation, model surgery, and dataset curation are already closing the gap. Look at what happened with LLMs: people said the same thing, yet now we have models like Mixtral and Qwen-2 outperforming some closed-source counterparts.
Kokoro-TTS dev’s approach is promising, and data constraints are real, but the OS community is very good at working around them—whether it’s synthetic data, fine-tuning, or leveraging existing foundation models in creative ways. Also, CSM’s edge isn’t just its model; it’s the dataset and post-processing tricks. Once those get replicated (and they will), you’ll see open-source alternatives catching up fast.
From my understanding, they’re essentially providing a fork of Moshi, which means we’re not training a model from scratch. Anyone with relatively modest compute can generate a pickled tensor voice—it’s not about brute-force training, but about using the underlying architecture and "acoustic tokens" to evoke natural inflections within the voice model.
And about Moshi, GPT-Omni, Ichigo—those were first attempts. It’s like saying open-source LLMs would never work because GPT-2 wasn’t as good as GPT-4. These things take iterations. If there's one thing OS has proven, it’s that it doesn’t stay behind for long.
— Reply to this email directly, view it on GitHub https://github.com/SesameAILabs/csm/issues/30#issuecomment-2716496705, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA2SFO7KLZN7SEGAHL7DB32T67LDAVCNFSM6AAAAABYVEBP5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMJWGQ4TMNZQGU . You are receiving this because you are subscribed to this thread.Message ID: @.***>
--
勇気とユーモア
We don’t know yet. They haven’t released anything so far. They will likely share a training script and some guidelines, but it might not be that easy to train a model like Maya again. They probably used a very good and curated dataset to achieve that level of “humanity” everyone is falling in love with. We’ll see.
Do not stress mate. The OS community has more combined compute than these companies lol. We will see plenty of realistic models rivaling Maya very quickly.
No, the open-source community just doesn’t have the resources to train these models from scratch at the same level as big corporations. Fine-tuning is doable, yes, but the real challenge is the lack of high-quality data that these corporations are able to access. Maybe something like Kokoro-TTS dev is doing, where large propietary models are distilled, could help here (to a degree). If open-source could really compete, we’d have seen models as good as CSM by now, but we haven’t. Without access to base models, progress would’ve been stuck for years, just waiting for a company to release something, just as we are now. And let’s be real, every speech-to-speech attempt so far (Moshi, GPT-Omni, Ichigo) has been a flop, interesting resarch-wise but tottaly unusable, unlike csm.
Collectively I believe OS does but getting that coordination is hard to use all the hardware to train. Collectively, the OS community has more engineers behind it and come up with innovations like quantization which make better models run on local hardware. Though, a large company compute could go further once they also adopt that idea.
But how would we train a voice using this CSM? Eleven Labs? Or will it come with a fine-tuned and trained default voice? I'm sorry I'm kinda new to all of this.
Despite the comments above from people who clearly don’t understand the underlying architecture—or how any of this actually works—this is just a fork of Moshi. It uses Mimi as its neural audio codec, meaning you're not training from scratch, just fine-tuning a pre-trained base. With consumer-grade hardware, this is entirely feasible.
The key is using a low learning rate and fine-tuning checkpoint by checkpoint to avoid distorting the pre-trained knowledge. PyTorch Lightning or the Hugging Face Trainer API can handle this efficiently, especially with FP16/BF16 mixed precision to optimize VRAM usage. If you have an RTX 3090/4090 or an A5000, you can fine-tune it locally without issue.
For example, in my home lab, I have a 4090, two A6000s, and three A5000s, and I was able to fine-tune similar models on ~25 hours of DBZ clips for Goku’s voice. Using DeepSpeed, the process took only ~16 hours.
There are multiple approaches. If you wanted to leverage Eleven Labs, you could distill their voices via API, generating enough training data to fine-tune the base model on a specific voice talent. Alternatively, if you're after Maya, you could simply distill their web app demo and reconstruct it yourself.
Yes, you could unironically distill Maya—ethically responsible? You decide.
We don’t know yet. They haven’t released anything so far. They will likely share a training script and some guidelines, but it might not be that easy to train a model like Maya again. They probably used a very good and curated dataset to achieve that level of “humanity” everyone is falling in love with. We’ll see.
Do not stress mate. The OS community has more combined compute than these companies lol. We will see plenty of realistic models rivaling Maya very quickly.
i really hope so
We don’t know yet. They haven’t released anything so far. They will likely share a training script and some guidelines, but it might not be that easy to train a model like Maya again. They probably used a very good and curated dataset to achieve that level of “humanity” everyone is falling in love with. We’ll see.
Do not stress mate. The OS community has more combined compute than these companies lol. We will see plenty of realistic models rivaling Maya very quickly.
No, the open-source community just doesn’t have the resources to train these models from scratch at the same level as big corporations. Fine-tuning is doable, yes, but the real challenge is the lack of high-quality data that these corporations are able to access. Maybe something like Kokoro-TTS dev is doing, where large propietary models are distilled, could help here (to a degree). If open-source could really compete, we’d have seen models as good as CSM by now, but we haven’t. Without access to base models, progress would’ve been stuck for years, just waiting for a company to release something, just as we are now. And let’s be real, every speech-to-speech attempt so far (Moshi, GPT-Omni, Ichigo) has been a flop, interesting resarch-wise but tottaly unusable, unlike csm.
Collectively I believe OS does but getting that coordination is hard to use all the hardware to train. Collectively, the OS community has more engineers behind it and come up with innovations like quantization which make better models run on local hardware. Though, a large company compute could go further once they also adopt that idea.
But how would we train a voice using this CSM? Eleven Labs? Or will it come with a fine-tuned and trained default voice? I'm sorry I'm kinda new to all of this.
Despite the comments above from people who clearly don’t understand the underlying architecture—or how any of this actually works—this is just a fork of Moshi. It uses Mimi as its neural audio codec, meaning you're not training from scratch, just fine-tuning a pre-trained base. With consumer-grade hardware, this is entirely feasible.
The key is using a low learning rate and fine-tuning checkpoint by checkpoint to avoid distorting the pre-trained knowledge. PyTorch Lightning or the Hugging Face Trainer API can handle this efficiently, especially with FP16/BF16 mixed precision to optimize VRAM usage. If you have an RTX 3090/4090 or an A5000, you can fine-tune it locally without issue.
For example, in my home lab, I have a 4090, two A6000s, and three A5000s, and I was able to fine-tune similar models on ~25 hours of DBZ clips for Goku’s voice. Using DeepSpeed, the process took only ~16 hours.
There are multiple approaches. If you wanted to leverage Eleven Labs, you could distill their voices via API, generating enough training data to fine-tune the base model on a specific voice talent. Alternatively, if you're after Maya, you could simply distill their web app demo and reconstruct it yourself.
Yes, you could unironically distill Maya—ethically responsible? You decide.
Lex, if I needed support or if I had any questions about training it myself (I have a 3090), could you help me? Can I ask you for help? Because, like I said, I've never done anything like that. I'm getting into it, but so far, I only know the tip of the iceberg. Is there any other way for me to keep in touch with you? If you don't mind, of course.
We don’t know yet. They haven’t released anything so far. They will likely share a training script and some guidelines, but it might not be that easy to train a model like Maya again. They probably used a very good and curated dataset to achieve that level of “humanity” everyone is falling in love with. We’ll see.
Do not stress mate. The OS community has more combined compute than these companies lol. We will see plenty of realistic models rivaling Maya very quickly.
No, the open-source community just doesn’t have the resources to train these models from scratch at the same level as big corporations. Fine-tuning is doable, yes, but the real challenge is the lack of high-quality data that these corporations are able to access. Maybe something like Kokoro-TTS dev is doing, where large propietary models are distilled, could help here (to a degree). If open-source could really compete, we’d have seen models as good as CSM by now, but we haven’t. Without access to base models, progress would’ve been stuck for years, just waiting for a company to release something, just as we are now. And let’s be real, every speech-to-speech attempt so far (Moshi, GPT-Omni, Ichigo) has been a flop, interesting resarch-wise but tottaly unusable, unlike csm.
Collectively I believe OS does but getting that coordination is hard to use all the hardware to train. Collectively, the OS community has more engineers behind it and come up with innovations like quantization which make better models run on local hardware. Though, a large company compute could go further once they also adopt that idea.
But how would we train a voice using this CSM? Eleven Labs? Or will it come with a fine-tuned and trained default voice? I'm sorry I'm kinda new to all of this.
Despite the comments above from people who clearly don’t understand the underlying architecture—or how any of this actually works—this is just a fork of Moshi. It uses Mimi as its neural audio codec, meaning you're not training from scratch, just fine-tuning a pre-trained base. With consumer-grade hardware, this is entirely feasible. The key is using a low learning rate and fine-tuning checkpoint by checkpoint to avoid distorting the pre-trained knowledge. PyTorch Lightning or the Hugging Face Trainer API can handle this efficiently, especially with FP16/BF16 mixed precision to optimize VRAM usage. If you have an RTX 3090/4090 or an A5000, you can fine-tune it locally without issue. For example, in my home lab, I have a 4090, two A6000s, and three A5000s, and I was able to fine-tune similar models on ~25 hours of DBZ clips for Goku’s voice. Using DeepSpeed, the process took only ~16 hours. There are multiple approaches. If you wanted to leverage Eleven Labs, you could distill their voices via API, generating enough training data to fine-tune the base model on a specific voice talent. Alternatively, if you're after Maya, you could simply distill their web app demo and reconstruct it yourself. Yes, you could unironically distill Maya—ethically responsible? You decide.
Lex, if I needed support or if I had any questions about training it myself (I have a 3090), could you help me? Can I ask you for help? Because, like I said, I've never done anything like that. I'm getting into it, but so far, I only know the tip of the iceberg. Is there any other way for me to keep in touch with you? If you don't mind, of course.
I'd love nothing more than to help you out. To get started, you'll need a solid understanding of Python, as well as how PyTorch and Transformers function under the hood. It's not going to be something you learn overnight. If you're comfortable with that, feel free to reach out—my Discord is 229184384610074624.
That said, it might be worth waiting a bit. I'm holding off until they release it on Hugging Face, so we can properly analyse the changes they've made from Moshi in their fork.
In the meantime, CivitAI could be a great resource if you don't want to actually do anything yourself. While it primarily focuses on diffusion models, there's already a lot of discussion happening around this release. The community is eager to create checkpoints on day one, so it’s definitely worth keeping an eye on. That’s why I said we’ll likely have voice models within a month.
We don’t know yet. They haven’t released anything so far. They will likely share a training script and some guidelines, but it might not be that easy to train a model like Maya again. They probably used a very good and curated dataset to achieve that level of “humanity” everyone is falling in love with. We’ll see.
Do not stress mate. The OS community has more combined compute than these companies lol. We will see plenty of realistic models rivaling Maya very quickly.
No, the open-source community just doesn’t have the resources to train these models from scratch at the same level as big corporations. Fine-tuning is doable, yes, but the real challenge is the lack of high-quality data that these corporations are able to access. Maybe something like Kokoro-TTS dev is doing, where large propietary models are distilled, could help here (to a degree). If open-source could really compete, we’d have seen models as good as CSM by now, but we haven’t. Without access to base models, progress would’ve been stuck for years, just waiting for a company to release something, just as we are now. And let’s be real, every speech-to-speech attempt so far (Moshi, GPT-Omni, Ichigo) has been a flop, interesting resarch-wise but tottaly unusable, unlike csm.
Collectively I believe OS does but getting that coordination is hard to use all the hardware to train. Collectively, the OS community has more engineers behind it and come up with innovations like quantization which make better models run on local hardware. Though, a large company compute could go further once they also adopt that idea.
But how would we train a voice using this CSM? Eleven Labs? Or will it come with a fine-tuned and trained default voice? I'm sorry I'm kinda new to all of this.
Despite the comments above from people who clearly don’t understand the underlying architecture—or how any of this actually works—this is just a fork of Moshi. It uses Mimi as its neural audio codec, meaning you're not training from scratch, just fine-tuning a pre-trained base. With consumer-grade hardware, this is entirely feasible. The key is using a low learning rate and fine-tuning checkpoint by checkpoint to avoid distorting the pre-trained knowledge. PyTorch Lightning or the Hugging Face Trainer API can handle this efficiently, especially with FP16/BF16 mixed precision to optimize VRAM usage. If you have an RTX 3090/4090 or an A5000, you can fine-tune it locally without issue. For example, in my home lab, I have a 4090, two A6000s, and three A5000s, and I was able to fine-tune similar models on ~25 hours of DBZ clips for Goku’s voice. Using DeepSpeed, the process took only ~16 hours. There are multiple approaches. If you wanted to leverage Eleven Labs, you could distill their voices via API, generating enough training data to fine-tune the base model on a specific voice talent. Alternatively, if you're after Maya, you could simply distill their web app demo and reconstruct it yourself. Yes, you could unironically distill Maya—ethically responsible? You decide.
Lex, if I needed support or if I had any questions about training it myself (I have a 3090), could you help me? Can I ask you for help? Because, like I said, I've never done anything like that. I'm getting into it, but so far, I only know the tip of the iceberg. Is there any other way for me to keep in touch with you? If you don't mind, of course.
I'd love nothing more than to help you out. To get started, you'll need a solid understanding of Python, as well as how PyTorch and Transformers function under the hood. It's not going to be something you learn overnight. If you're comfortable with that, feel free to reach out—my Discord is 229184384610074624.
That said, it might be worth waiting a bit. I'm holding off until they release it on Hugging Face, so we can properly analyse the changes they've made from Moshi in their fork.
In the meantime, CivitAI could be a great resource if you don't want to actually do anything yourself. While it primarily focuses on diffusion models, there's already a lot of discussion happening around this release. The community is eager to create checkpoints on day one, so it’s definitely worth keeping an eye on. That’s why I said we’ll likely have voice models within a month.
I'm sorry, but I can't add you using your ID if we don't share a server in common. What is your username there so I can add you on Discord? Mine is Harmony. I'd love to have tutoring from someone who understands, and I can't thank you enough for the help you're providing me, Lex. Thank you so much!
We don’t know yet. They haven’t released anything so far. They will likely share a training script and some guidelines, but it might not be that easy to train a model like Maya again. They probably used a very good and curated dataset to achieve that level of “humanity” everyone is falling in love with. We’ll see.
Do not stress mate. The OS community has more combined compute than these companies lol. We will see plenty of realistic models rivaling Maya very quickly.
No, the open-source community just doesn’t have the resources to train these models from scratch at the same level as big corporations. Fine-tuning is doable, yes, but the real challenge is the lack of high-quality data that these corporations are able to access. Maybe something like Kokoro-TTS dev is doing, where large propietary models are distilled, could help here (to a degree). If open-source could really compete, we’d have seen models as good as CSM by now, but we haven’t. Without access to base models, progress would’ve been stuck for years, just waiting for a company to release something, just as we are now. And let’s be real, every speech-to-speech attempt so far (Moshi, GPT-Omni, Ichigo) has been a flop, interesting resarch-wise but tottaly unusable, unlike csm.
Collectively I believe OS does but getting that coordination is hard to use all the hardware to train. Collectively, the OS community has more engineers behind it and come up with innovations like quantization which make better models run on local hardware. Though, a large company compute could go further once they also adopt that idea.
But how would we train a voice using this CSM? Eleven Labs? Or will it come with a fine-tuned and trained default voice? I'm sorry I'm kinda new to all of this.
Despite the comments above from people who clearly don’t understand the underlying architecture—or how any of this actually works—this is just a fork of Moshi. It uses Mimi as its neural audio codec, meaning you're not training from scratch, just fine-tuning a pre-trained base. With consumer-grade hardware, this is entirely feasible. The key is using a low learning rate and fine-tuning checkpoint by checkpoint to avoid distorting the pre-trained knowledge. PyTorch Lightning or the Hugging Face Trainer API can handle this efficiently, especially with FP16/BF16 mixed precision to optimize VRAM usage. If you have an RTX 3090/4090 or an A5000, you can fine-tune it locally without issue. For example, in my home lab, I have a 4090, two A6000s, and three A5000s, and I was able to fine-tune similar models on ~25 hours of DBZ clips for Goku’s voice. Using DeepSpeed, the process took only ~16 hours. There are multiple approaches. If you wanted to leverage Eleven Labs, you could distill their voices via API, generating enough training data to fine-tune the base model on a specific voice talent. Alternatively, if you're after Maya, you could simply distill their web app demo and reconstruct it yourself. Yes, you could unironically distill Maya—ethically responsible? You decide.
Lex, if I needed support or if I had any questions about training it myself (I have a 3090), could you help me? Can I ask you for help? Because, like I said, I've never done anything like that. I'm getting into it, but so far, I only know the tip of the iceberg. Is there any other way for me to keep in touch with you? If you don't mind, of course.
I'd love nothing more than to help you out. To get started, you'll need a solid understanding of Python, as well as how PyTorch and Transformers function under the hood. It's not going to be something you learn overnight. If you're comfortable with that, feel free to reach out—my Discord is 229184384610074624.
That said, it might be worth waiting a bit. I'm holding off until they release it on Hugging Face, so we can properly analyse the changes they've made from Moshi in their fork.
In the meantime, CivitAI could be a great resource if you don't want to actually do anything yourself. While it primarily focuses on diffusion models, there's already a lot of discussion happening around this release. The community is eager to create checkpoints on day one, so it’s definitely worth keeping an eye on. That’s why I said we’ll likely have voice models within a month.
I've send you a friend request on Discord, check it out!
Despite the comments above from people who clearly don’t understand the underlying architecture—or how any of this actually works—this is just a fork of Moshi. It uses Mimi as its neural audio codec, meaning you're not training from scratch, just fine-tuning a pre-trained base. With consumer-grade hardware, this is entirely feasible.
You are reiterating my point: we, the OS community, do not have the resources to train such a base model from scratch. Of course, if Sesame releases their forked base model, we will be able to fine-tune it. But we need them to do it, or otherwise we are stuck as we have been so far. It is technically true that the OS community as a whole may have more combined resources than a single big corpo, but as you have said, disorganization renders that potential advantage meaningless. We’re at a disadvantage, and we’ll always be playing catch-up. We have already seen countless clones of the latest OpenAI feature, and they are mostly useless (if they work at all) and stop being maintained in a few weeks. The same goes for the many S2S attempts so far; none has been successful. So we still depend on big corpos to move us forward. Let's not be delusional about this.
Despite the comments above from people who clearly don’t understand the underlying architecture—or how any of this actually works—this is just a fork of Moshi. It uses Mimi as its neural audio codec, meaning you're not training from scratch, just fine-tuning a pre-trained base. With consumer-grade hardware, this is entirely feasible.
You are reiterating my point: we, the OS community, do not have the resources to train such a base model from scratch. Of course, if Sesame releases their forked base model, we will be able to fine-tune it. But we need them to do it, or otherwise we are stuck as we have been so far. It is technically true that the OS community as a whole may have more combined resources than a single big corpo, but as you have said, disorganization renders that potential advantage meaningless. We’re at a disadvantage, and we’ll always be playing catch-up. We have already seen countless clones of the latest OpenAI feature, and they are mostly useless (if they work at all) and stop being maintained in a few weeks. The same goes for the many S2S attempts so far; none has been successful. So we still depend on big corpos to move us forward. Let's not be delusional about this.
You claim the open-source community "doesn’t have the resources" to train models from scratch, yet the entire foundation of Sesame’s model is a fork of Moshi—the so-called ‘useless’ project from an ‘underfunded’ academic team in Paris.
So which is it? If open-source couldn’t compete, then how did an underfunded academic lab build the model Sesame is now riding on? You can’t dismiss OS while simultaneously praising a project that only exists because of open-source research. The fact that a tiny academic team—not some billion-pound corporation—built the architecture that Sesame refined proves beyond a doubt that this level of progress is absolutely within OS reach.
And let’s talk about "we don’t have enough compute."
It’s not 2019 anymore—training a model like this doesn’t cost what it did even two years ago. Efficiency gains, better scaling strategies, and improved hardware accessibility have slashed training costs to a fraction of what they used to be. You can literally spin up H100s for dollars an hour on cloud platforms, and we now have dedicated spaces like Unsloth and Hugging Face’s API trainer making fine-tuning more accessible than ever. So where exactly is this idea coming from that compute is some sacred, unattainable grail?
We’ve already seen open-source train massive models that were once considered impossible. Falcon, Mistral, Zephyr, and every single SDXL variant prove that compute is no longer the bottleneck. People are training LLMs on consumer-grade hardware now. Your argument might have been valid three years ago, but today? It’s just not the case.
And about that claim that "if OS could compete, we’d have seen models as good as CSM by now"—that’s exactly what happened in diffusion.
And since you seem to have missed this part, Sesame’s entire model is literally a collection of open-source forks.
They’re using:
So can you rationalise how the OS community "doesn’t have the resources" when every single tool Sesame is using is open-source? The only thing Sesame did was refine and integrate these tools—which is valuable work, let’s not misconstrue that—but it doesn’t change the fact that the open-source community built the core components.
So no, OS isn’t "waiting around for corporations to save us"—it’s laying the groundwork that corporations rely on.
We don’t know yet. They haven’t released anything so far. They will likely share a training script and some guidelines, but it might not be that easy to train a model like Maya again. They probably used a very good and curated dataset to achieve that level of “humanity” everyone is falling in love with. We’ll see.
Do not stress mate. The OS community has more combined compute than these companies lol. We will see plenty of realistic models rivaling Maya very quickly.
No, the open-source community just doesn’t have the resources to train these models from scratch at the same level as big corporations. Fine-tuning is doable, yes, but the real challenge is the lack of high-quality data that these corporations are able to access. Maybe something like Kokoro-TTS dev is doing, where large propietary models are distilled, could help here (to a degree). If open-source could really compete, we’d have seen models as good as CSM by now, but we haven’t. Without access to base models, progress would’ve been stuck for years, just waiting for a company to release something, just as we are now. And let’s be real, every speech-to-speech attempt so far (Moshi, GPT-Omni, Ichigo) has been a flop, interesting resarch-wise but tottaly unusable, unlike csm.
Collectively I believe OS does but getting that coordination is hard to use all the hardware to train. Collectively, the OS community has more engineers behind it and come up with innovations like quantization which make better models run on local hardware. Though, a large company compute could go further once they also adopt that idea.
But how would we train a voice using this CSM? Eleven Labs? Or will it come with a fine-tuned and trained default voice? I'm sorry I'm kinda new to all of this.
Despite the comments above from people who clearly don’t understand the underlying architecture—or how any of this actually works—this is just a fork of Moshi. It uses Mimi as its neural audio codec, meaning you're not training from scratch, just fine-tuning a pre-trained base. With consumer-grade hardware, this is entirely feasible. The key is using a low learning rate and fine-tuning checkpoint by checkpoint to avoid distorting the pre-trained knowledge. PyTorch Lightning or the Hugging Face Trainer API can handle this efficiently, especially with FP16/BF16 mixed precision to optimize VRAM usage. If you have an RTX 3090/4090 or an A5000, you can fine-tune it locally without issue. For example, in my home lab, I have a 4090, two A6000s, and three A5000s, and I was able to fine-tune similar models on ~25 hours of DBZ clips for Goku’s voice. Using DeepSpeed, the process took only ~16 hours. There are multiple approaches. If you wanted to leverage Eleven Labs, you could distill their voices via API, generating enough training data to fine-tune the base model on a specific voice talent. Alternatively, if you're after Maya, you could simply distill their web app demo and reconstruct it yourself. Yes, you could unironically distill Maya—ethically responsible? You decide.
Lex, if I needed support or if I had any questions about training it myself (I have a 3090), could you help me? Can I ask you for help? Because, like I said, I've never done anything like that. I'm getting into it, but so far, I only know the tip of the iceberg. Is there any other way for me to keep in touch with you? If you don't mind, of course.
I'd love nothing more than to help you out. To get started, you'll need a solid understanding of Python, as well as how PyTorch and Transformers function under the hood. It's not going to be something you learn overnight. If you're comfortable with that, feel free to reach out—my Discord is 229184384610074624.
That said, it might be worth waiting a bit. I'm holding off until they release it on Hugging Face, so we can properly analyse the changes they've made from Moshi in their fork.
In the meantime, CivitAI could be a great resource if you don't want to actually do anything yourself. While it primarily focuses on diffusion models, there's already a lot of discussion happening around this release. The community is eager to create checkpoints on day one, so it’s definitely worth keeping an eye on. That’s why I said we’ll likely have voice models within a month.
A friend request was sent, thank you so much, Lex.
Despite the comments above from people who clearly don’t understand the underlying architecture—or how any of this actually works—this is just a fork of Moshi. It uses Mimi as its neural audio codec, meaning you're not training from scratch, just fine-tuning a pre-trained base. With consumer-grade hardware, this is entirely feasible.
You are reiterating my point: we, the OS community, do not have the resources to train such a base model from scratch. Of course, if Sesame releases their forked base model, we will be able to fine-tune it. But we need them to do it, or otherwise we are stuck as we have been so far. It is technically true that the OS community as a whole may have more combined resources than a single big corpo, but as you have said, disorganization renders that potential advantage meaningless. We’re at a disadvantage, and we’ll always be playing catch-up. We have already seen countless clones of the latest OpenAI feature, and they are mostly useless (if they work at all) and stop being maintained in a few weeks. The same goes for the many S2S attempts so far; none has been successful. So we still depend on big corpos to move us forward. Let's not be delusional about this.
You claim the open-source community "doesn’t have the resources" to train models from scratch, yet the entire foundation of Sesame’s model is a fork of Moshi—the so-called ‘useless’ project from an ‘underfunded’ academic team in Paris.
So which is it? If open-source couldn’t compete, then how did an underfunded academic lab build the model Sesame is now riding on? You can’t dismiss OS while simultaneously praising a project that only exists because of open-source research. The fact that a tiny academic team—not some billion-pound corporation—built the architecture that Sesame refined proves beyond a doubt that this level of progress is absolutely within OS reach.
And let’s talk about "we don’t have enough compute."
It’s not 2019 anymore—training a model like this doesn’t cost what it did even two years ago. Efficiency gains, better scaling strategies, and improved hardware accessibility have slashed training costs to a fraction of what they used to be. You can literally spin up H100s for dollars an hour on cloud platforms, and we now have dedicated spaces like Unsloth and Hugging Face’s API trainer making fine-tuning more accessible than ever. So where exactly is this idea coming from that compute is some sacred, unattainable grail?
We’ve already seen open-source train massive models that were once considered impossible. Falcon, Mistral, Zephyr, and every single SDXL variant prove that compute is no longer the bottleneck. People are training LLMs on consumer-grade hardware now. Your argument might have been valid three years ago, but today? It’s just not the case.
And about that claim that "if OS could compete, we’d have seen models as good as CSM by now"—that’s exactly what happened in diffusion.
- Stability AI released SD1.4, but the OS community immediately took over
- ControlNet? OS.
- Refiner models? OS.
- ComfyUI’s workflow-based optimisations? OS.
- Every major inference optimisation that makes Stable Diffusion run faster and cheaper? OS.
- Open-source didn’t just catch up—it outpaced corporate development within months. The same thing is going to happen here.
And since you seem to have missed this part, Sesame’s entire model is literally a collection of open-source forks.
They’re using:
- Moshi – The actual backbone of their model, built by an underfunded academic team in Paris
- WhisperX – Open-source forced alignment for transcriptions, providing word-level timestamps to ensure precise synchronisation in speech processing.
- Faster Whisper Plus – An optimised and low-latency Whisper fork, used for fast, efficient transcription before alignment.
- WavTools – An open-source library for audio manipulation, likely used for processing and modifying speech waveforms.
- SGLang – An open-source speech-to-speech language modelling toolkit, aiding in prosody control, phoneme mapping, and multi-lingual synthesis.
- Silero VAD – An open-source voice activity detection model, helping detect speech segments vs. silence/noise for cleaner output.
- GPT-Fast – A lightweight GPT-based processing library, likely used for text pre/post-processing or additional inference speed-ups.
So can you rationalise how the OS community "doesn’t have the resources" when every single tool Sesame is using is open-source? The only thing Sesame did was refine and integrate these tools—which is valuable work, let’s not misconstrue that—but it doesn’t change the fact that the open-source community built the core components.
So no, OS isn’t "waiting around for corporations to save us"—it’s laying the groundwork that corporations rely on.
i agree on most of the stuff you pointed out - however the team behind mochi isnt underfunded at all
apart from that there statement was they used 1mil hours of audio - having that data is 1 thing training a other - even if you have the capital for the compute thats really the crux as its a 2 stage training process
unsloth cant train audio tokens yet / and its not planned in the short term either ( i contract with unsloth )
--- so while you certainly can spin up 8/16 h100's for little cash .. getting access to 64-128 is still a major problem - needs to be in the same datacenter as even 400gb's are painfully slow .. and clusters at that size come with there own problems
whisper asr backbone is a few hours as adaptors as ultravox as the code for it .. but its far from as trivial as you make it out to be
the biggest question is still if we even get the model - as that could be a HR ploy or collection for preference data too / well have to wait and see
Ive managed to finetune Moshi using a modified version of this repo https://github.com/yangdongchao/RSTnet using lora on a A100 in colabs, because the A100 only has 40gb im somewhat limited by dataset size, but while noting the fact that this model is supposedly 8b compared to moshi being 7b I think it would certainly be possible to do with a few more optimizations, my main issue finetuning it is while I can change moshis voice, i dont have enough VRAM to give it proper intelligence so its still as dumb as moshi was, if this base model already has strong intelligence we should be able to add voices without too many issues I reckon.
We don’t know yet. They haven’t released anything so far. They will likely share a training script and some guidelines, but it might not be that easy to train a model like Maya again. They probably used a very good and curated dataset to achieve that level of “humanity” everyone is falling in love with. We’ll see.
Do not stress mate. The OS community has more combined compute than these companies lol. We will see plenty of realistic models rivaling Maya very quickly.
No, the open-source community just doesn’t have the resources to train these models from scratch at the same level as big corporations. Fine-tuning is doable, yes, but the real challenge is the lack of high-quality data that these corporations are able to access. Maybe something like Kokoro-TTS dev is doing, where large propietary models are distilled, could help here (to a degree). If open-source could really compete, we’d have seen models as good as CSM by now, but we haven’t. Without access to base models, progress would’ve been stuck for years, just waiting for a company to release something, just as we are now. And let’s be real, every speech-to-speech attempt so far (Moshi, GPT-Omni, Ichigo) has been a flop, interesting resarch-wise but tottaly unusable, unlike csm.
Not entirely true, mate. Open-source doesn't need to train from scratch to compete—distillation, model surgery, and dataset curation are already closing the gap. Look at what happened with LLMs: people said the same thing, yet now we have models like Mixtral and Qwen-2 outperforming some closed-source counterparts.
Kokoro-TTS dev’s approach is promising, and data constraints are real, but the OS community is very good at working around them—whether it’s synthetic data, fine-tuning, or leveraging existing foundation models in creative ways. Also, CSM’s edge isn’t just its model; it’s the dataset and post-processing tricks. Once those get replicated (and they will), you’ll see open-source alternatives catching up fast.
From my understanding, they’re essentially providing a fork of Moshi, which means we’re not training a model from scratch. Anyone with relatively modest compute can generate a pickled tensor voice—it’s not about brute-force training, but about using the underlying architecture and "acoustic tokens" to evoke natural inflections within the voice model.
And about Moshi, GPT-Omni, Ichigo—those were first attempts. It’s like saying open-source LLMs would never work because GPT-2 wasn’t as good as GPT-4. These things take iterations. If there's one thing OS has proven, it’s that it doesn’t stay behind for long.
Yeah, that's exactly what I said. I agree that the open-source community can fine-tune models, but we still rely on big corporations to release high-quality base models to build on. If Sesame hadn't released their base model, none of what you mentioned would even be possible. I hope we can agree on that.
That’s precisely why we need Alibaba and Mistral to release their models so that talented individuals can develop techniques to commoditize and democratize this technology. If Meta, Mistral, Alibaba and others hadn’t released their models, we wouldn’t have had anything to work with in the first place.
Training models at this scale requires massive investment of money and time, and the few entities capable and incentivized of doing it aren’t likely to release their work openly or freely unless it aligns with their business or professional interests.
Even when companies or groups do release models, it’s rarely true open-source. What we almost always get are pretrained weight dumps without datasets, training code, or the ability to reproduce results. Kokoro TTS is a clear example. The developer has invested heavily in training these models but hasn’t shared the training process because it holds commercial value (their words). There is no dataset, no code. The community-driven development is severly limited. Calling that open-source is misleading.
We are not independent. We rely on him to move forward. You can hack together ways to fine-tune Kokoro’s models, but you will always be limited by the base model provided through his generosity. And while I am deeply grateful to him, because so few people would be that generous, it still highlights the fundamental issue I am insisting on. Until we can train and reproduce these kind of SOTA models ourselves from scratch, we are at the mercy of those who choose to share, and they can stop at any time. This perfectly applies to Sesame and csm as well. We need to accept reality as it is.
The open-source community, meaning developers, builders, and users (not Meta, not Moshi, not Mistral), can’t compete or provide solutions without these companies’ contributions. That’s why so many people are eagerly waiting for Sesame to release their model and why there’s already frustration over the delays or limited access.
Including Meta and other companies as part of the open-source ecosystem is fine as long as their corporate goals are not directly hinder them by the fact of making these models freely available (which is the case of Meta and several venture/state-backed corporations). While I do not consider them truly OS, they are enabling the contribution to the development of OS tools and techniques, as you have mentioned. But the moment they shift away from that, we’re left with nothing to continue working. We can’t do this alone, at least not yet. Hopefully, one day, we’ll be empowered enough to move forward on our own.
That said, I don’t consider these companies truly part of the open-source community. Releasing massive binary blobs of model weights without transparency or reproducibility is not open-source at all. At any point, they can decide that open weights no longer align with their business strategy and shift to a fully closed model, leaving us with no way forward. Their participation is conditional, driven by their own interests, not by a true commitment to open-source principles.
To be clear, we use the Mimi RVQ codec, but not the Moshi model. And yes, we do benefit from the open source community greatly, which is why we hope to contribute to it. As a product company, it's a balance we have to strike.
Despite the comments above from people who clearly don’t understand the underlying architecture—or how any of this actually works—this is just a fork of Moshi. It uses Mimi as its neural audio codec, meaning you're not training from scratch, just fine-tuning a pre-trained base. With consumer-grade hardware, this is entirely feasible.
You are reiterating my point: we, the OS community, do not have the resources to train such a base model from scratch. Of course, if Sesame releases their forked base model, we will be able to fine-tune it. But we need them to do it, or otherwise we are stuck as we have been so far. It is technically true that the OS community as a whole may have more combined resources than a single big corpo, but as you have said, disorganization renders that potential advantage meaningless. We’re at a disadvantage, and we’ll always be playing catch-up. We have already seen countless clones of the latest OpenAI feature, and they are mostly useless (if they work at all) and stop being maintained in a few weeks. The same goes for the many S2S attempts so far; none has been successful. So we still depend on big corpos to move us forward. Let's not be delusional about this.
You claim the open-source community "doesn’t have the resources" to train models from scratch, yet the entire foundation of Sesame’s model is a fork of Moshi—the so-called ‘useless’ project from an ‘underfunded’ academic team in Paris.
So which is it? If open-source couldn’t compete, then how did an underfunded academic lab build the model Sesame is now riding on? You can’t dismiss OS while simultaneously praising a project that only exists because of open-source research. The fact that a tiny academic team—not some billion-pound corporation—built the architecture that Sesame refined proves beyond a doubt that this level of progress is absolutely within OS reach.
And let’s talk about "we don’t have enough compute."
It’s not 2019 anymore—training a model like this doesn’t cost what it did even two years ago. Efficiency gains, better scaling strategies, and improved hardware accessibility have slashed training costs to a fraction of what they used to be. You can literally spin up H100s for dollars an hour on cloud platforms, and we now have dedicated spaces like Unsloth and Hugging Face’s API trainer making fine-tuning more accessible than ever. So where exactly is this idea coming from that compute is some sacred, unattainable grail?
We’ve already seen open-source train massive models that were once considered impossible. Falcon, Mistral, Zephyr, and every single SDXL variant prove that compute is no longer the bottleneck. People are training LLMs on consumer-grade hardware now. Your argument might have been valid three years ago, but today? It’s just not the case.
And about that claim that "if OS could compete, we’d have seen models as good as CSM by now"—that’s exactly what happened in diffusion.
1. Stability AI released SD1.4, but the OS community immediately took over 2. ControlNet? OS. 3. Refiner models? OS. 4. ComfyUI’s workflow-based optimisations? OS. 5. Every major inference optimisation that makes Stable Diffusion run faster and cheaper? OS. 6. Open-source didn’t just catch up—it outpaced corporate development within months. The same thing is going to happen here.
And since you seem to have missed this part, Sesame’s entire model is literally a collection of open-source forks.
They’re using:
1. Moshi – The actual backbone of their model, built by an underfunded academic team in Paris 2. WhisperX – Open-source forced alignment for transcriptions, providing word-level timestamps to ensure precise synchronisation in speech processing. 3. Faster Whisper Plus – An optimised and low-latency Whisper fork, used for fast, efficient transcription before alignment. 4. WavTools – An open-source library for audio manipulation, likely used for processing and modifying speech waveforms. 5. SGLang – An open-source speech-to-speech language modelling toolkit, aiding in prosody control, phoneme mapping, and multi-lingual synthesis. 6. Silero VAD – An open-source voice activity detection model, helping detect speech segments vs. silence/noise for cleaner output. 7. GPT-Fast – A lightweight GPT-based processing library, likely used for text pre/post-processing or additional inference speed-ups.
So can you rationalise how the OS community "doesn’t have the resources" when every single tool Sesame is using is open-source? The only thing Sesame did was refine and integrate these tools—which is valuable work, let’s not misconstrue that—but it doesn’t change the fact that the open-source community built the core components.
So no, OS isn’t "waiting around for corporations to save us"—it’s laying the groundwork that corporations rely on.
You're confusing fine-tuning with training from scratch. I never claimed we couldn’t fine-tune models—quite the opposite. Please read my comment again. If Stability AI hadn’t released Stable Diffusion, platforms like A1111 and CivitAI wouldn’t exist. The community would have been waiting on Flux or some other company to release it. WhisperX wouldn’t exist if OpenAI hadn’t released Whisper, which set a new SOTA benchmark and outperformed anything else that had been openly researched.
You’re also misrepresenting the role of companies like Stability AI, treating them as if they’re just another part of the open-source community. They are not. Stability AI is a venture-backed company with massive funding, and its decision to release Stable Diffusion was based on business strategy, not some grassroots OS initiative. The same applies to almost every example you mentioned. These companies released models because it aligned with their business interests, not because they are fundamentally open-source entities. Calling them part of the OS movement is misleading. They are incidental contributors, not the backbone of open-source AI.
The reality is that the community doesn’t train base models from scratch—we wait for big players to release them. There’s no fine-tuning of CSM without a base model, and if Sesame chooses not to release theirs, nobody in the OS community can replicate it today. That’s a fact, and it’s time to acknowledge it. Every tool you listed depends on foundational models that were built and released by large corporations that invested a significant amount of money (in the order of millions for Whisper, SD, LLaMa).
And no, Sesame didn’t just integrate a few OS tools to come up with csm. They trained the Moshi encoder and decoder from scratch using a curated dataset of one million hours of audio, with many trial and error to get where they are now. If you were to do that yourself, you’d spend months just gathering and cleaning the data, researching, developing and debugging efficient training techniques, then months more optimizing for speech fluency, inference latency and cost (compute time and memory). This isn’t a trivial effort, and it’s absurd to downplay their role. I don’t know where this arrogance comes from, but pretending the OS community is self-sufficient when we still depend on corporations to provide base models is just denial.
There has yet to be a single foundational model created entirely by the open-source community that can remotely compete with the SOTA models (TTS, LLM, T2I, T2V). Everything we have is built on fine-tuning models that were developed with significant investment from very well-funded companies.
I am not dismissing the early S2S attempts that were developed more openly from the start. I acknowledge their value and appreciate the effort behind them. However, the reality is that they are not yet as usable for end users as CSM is. That’s all I was saying. It seems you are trying to distort my words.
Commercial incentives are what drive this technology forward, and that applies to Sesame, Meta, Mistral, Alibaba, OpenAI, and many others. We still depend on companies with the resources to build these foundational models, and, without them, we wouldn’t have anything to build upon. Perhaps a bit less arrogance and a more realistic perspective would be helpful here.
Despite the comments above from people who clearly don’t understand the underlying architecture—or how any of this actually works—this is just a fork of Moshi. It uses Mimi as its neural audio codec, meaning you're not training from scratch, just fine-tuning a pre-trained base. With consumer-grade hardware, this is entirely feasible.
You are reiterating my point: we, the OS community, do not have the resources to train such a base model from scratch. Of course, if Sesame releases their forked base model, we will be able to fine-tune it. But we need them to do it, or otherwise we are stuck as we have been so far. It is technically true that the OS community as a whole may have more combined resources than a single big corpo, but as you have said, disorganization renders that potential advantage meaningless. We’re at a disadvantage, and we’ll always be playing catch-up. We have already seen countless clones of the latest OpenAI feature, and they are mostly useless (if they work at all) and stop being maintained in a few weeks. The same goes for the many S2S attempts so far; none has been successful. So we still depend on big corpos to move us forward. Let's not be delusional about this.
You claim the open-source community "doesn’t have the resources" to train models from scratch, yet the entire foundation of Sesame’s model is a fork of Moshi—the so-called ‘useless’ project from an ‘underfunded’ academic team in Paris. So which is it? If open-source couldn’t compete, then how did an underfunded academic lab build the model Sesame is now riding on? You can’t dismiss OS while simultaneously praising a project that only exists because of open-source research. The fact that a tiny academic team—not some billion-pound corporation—built the architecture that Sesame refined proves beyond a doubt that this level of progress is absolutely within OS reach. And let’s talk about "we don’t have enough compute." It’s not 2019 anymore—training a model like this doesn’t cost what it did even two years ago. Efficiency gains, better scaling strategies, and improved hardware accessibility have slashed training costs to a fraction of what they used to be. You can literally spin up H100s for dollars an hour on cloud platforms, and we now have dedicated spaces like Unsloth and Hugging Face’s API trainer making fine-tuning more accessible than ever. So where exactly is this idea coming from that compute is some sacred, unattainable grail? We’ve already seen open-source train massive models that were once considered impossible. Falcon, Mistral, Zephyr, and every single SDXL variant prove that compute is no longer the bottleneck. People are training LLMs on consumer-grade hardware now. Your argument might have been valid three years ago, but today? It’s just not the case. And about that claim that "if OS could compete, we’d have seen models as good as CSM by now"—that’s exactly what happened in diffusion.
- Stability AI released SD1.4, but the OS community immediately took over
- ControlNet? OS.
- Refiner models? OS.
- ComfyUI’s workflow-based optimisations? OS.
- Every major inference optimisation that makes Stable Diffusion run faster and cheaper? OS.
- Open-source didn’t just catch up—it outpaced corporate development within months. The same thing is going to happen here.
And since you seem to have missed this part, Sesame’s entire model is literally a collection of open-source forks. They’re using:
- Moshi – The actual backbone of their model, built by an underfunded academic team in Paris
- WhisperX – Open-source forced alignment for transcriptions, providing word-level timestamps to ensure precise synchronisation in speech processing.
- Faster Whisper Plus – An optimised and low-latency Whisper fork, used for fast, efficient transcription before alignment.
- WavTools – An open-source library for audio manipulation, likely used for processing and modifying speech waveforms.
- SGLang – An open-source speech-to-speech language modelling toolkit, aiding in prosody control, phoneme mapping, and multi-lingual synthesis.
- Silero VAD – An open-source voice activity detection model, helping detect speech segments vs. silence/noise for cleaner output.
- GPT-Fast – A lightweight GPT-based processing library, likely used for text pre/post-processing or additional inference speed-ups.
So can you rationalise how the OS community "doesn’t have the resources" when every single tool Sesame is using is open-source? The only thing Sesame did was refine and integrate these tools—which is valuable work, let’s not misconstrue that—but it doesn’t change the fact that the open-source community built the core components. So no, OS isn’t "waiting around for corporations to save us"—it’s laying the groundwork that corporations rely on.
i agree on most of the stuff you pointed out - however the team behind mochi isnt underfunded at all
apart from that there statement was they used 1mil hours of audio - having that data is 1 thing training a other - even if you have the capital for the compute thats really the crux as its a 2 stage training process
unsloth cant train audio tokens yet / and its not planned in the short term either ( i contract with unsloth )
--- so while you certainly can spin up 8/16 h100's for little cash .. getting access to 64-128 is still a major problem - needs to be in the same datacenter as even 400gb's are painfully slow .. and clusters at that size come with there own problems
whisper asr backbone is a few hours as adaptors as ultravox as the code for it .. but its far from as trivial as you make it out to be
the biggest question is still if we even get the model - as that could be a HR ploy or collection for preference data too / well have to wait and see
I agree with most of what you said, with minor clarifications. "Underfunded" was meant relatively—compared to major corporate backing like a16z, which funds Sesame. Kyutai’s conference framed Moshi as an academic effort, but after your reply, I looked further (TechCrunch, for instance) and found they reportedly secured over $300 million in late 2023. In fairness, their 42-minute YouTube showcase made it seem like a purely academic venture.
From Sesame’s technical paper, and my understanding of it, the million-hour dataset primarily relates to acoustic tokens, which govern articulation—tone, inflections, and suprasegmental features (pitch, volume, rhythm)—rather than speaker identity. While acoustic tokens modulate prosody in synthesis, fine-tuning the base voice model (as with Moshi) remains viable, and pre-trained knowledge should still allow for external checkpointing and adaptation.
Also, love that there's more direct discussion here out of their entire GIT—OS community is clearly champing at the bit for release. CivitAI is on fire, too. I've already started compiling audio for a specific voice actress—about 45 hours so far from interviews, films, and TV shows—planning to fine-tune day one on my home lab.
We don’t know yet. They haven’t released anything so far. They will likely share a training script and some guidelines, but it might not be that easy to train a model like Maya again. They probably used a very good and curated dataset to achieve that level of “humanity” everyone is falling in love with. We’ll see.
Do not stress mate. The OS community has more combined compute than these companies lol. We will see plenty of realistic models rivaling Maya very quickly.
No, the open-source community just doesn’t have the resources to train these models from scratch at the same level as big corporations. Fine-tuning is doable, yes, but the real challenge is the lack of high-quality data that these corporations are able to access. Maybe something like Kokoro-TTS dev is doing, where large propietary models are distilled, could help here (to a degree). If open-source could really compete, we’d have seen models as good as CSM by now, but we haven’t. Without access to base models, progress would’ve been stuck for years, just waiting for a company to release something, just as we are now. And let’s be real, every speech-to-speech attempt so far (Moshi, GPT-Omni, Ichigo) has been a flop, interesting resarch-wise but tottaly unusable, unlike csm.
Not entirely true, mate. Open-source doesn't need to train from scratch to compete—distillation, model surgery, and dataset curation are already closing the gap. Look at what happened with LLMs: people said the same thing, yet now we have models like Mixtral and Qwen-2 outperforming some closed-source counterparts. Kokoro-TTS dev’s approach is promising, and data constraints are real, but the OS community is very good at working around them—whether it’s synthetic data, fine-tuning, or leveraging existing foundation models in creative ways. Also, CSM’s edge isn’t just its model; it’s the dataset and post-processing tricks. Once those get replicated (and they will), you’ll see open-source alternatives catching up fast. From my understanding, they’re essentially providing a fork of Moshi, which means we’re not training a model from scratch. Anyone with relatively modest compute can generate a pickled tensor voice—it’s not about brute-force training, but about using the underlying architecture and "acoustic tokens" to evoke natural inflections within the voice model. And about Moshi, GPT-Omni, Ichigo—those were first attempts. It’s like saying open-source LLMs would never work because GPT-2 wasn’t as good as GPT-4. These things take iterations. If there's one thing OS has proven, it’s that it doesn’t stay behind for long.
Yeah, that's exactly what I said. I agree that the open-source community can fine-tune models, but we still rely on big corporations to release high-quality base models to build on. If Sesame hadn't released their base model, none of what you mentioned would even be possible. I hope we can agree on that.
That’s precisely why we need Alibaba and Mistral to release their models so that talented individuals can develop techniques to commoditize and democratize this technology. If Meta, Mistral, Alibaba and others hadn’t released their models, we wouldn’t have had anything to work with in the first place.
Training models at this scale requires massive investment of money and time, and the few entities capable and incentivized of doing it aren’t likely to release their work openly or freely unless it aligns with their business or professional interests.
Even when companies or groups do release models, it’s rarely true open-source. What we almost always get are pretrained weight dumps without datasets, training code, or the ability to reproduce results. Kokoro TTS is a clear example. The developer has invested heavily in training these models but hasn’t shared the training process because it holds commercial value (their words). There is no dataset, no code. The community-driven development is severly limited. Calling that open-source is misleading.
We are not independent. We rely on him to move forward. You can hack together ways to fine-tune Kokoro’s models, but you will always be limited by the base model provided through his generosity. And while I am deeply grateful to him, because so few people would be that generous, it still highlights the fundamental issue I am insisting on. Until we can train and reproduce these kind of SOTA models ourselves from scratch, we are at the mercy of those who choose to share, and they can stop at any time. This perfectly applies to Sesame and csm as well. We need to accept reality as it is.
The open-source community, meaning developers, builders, and users (not Meta, not Moshi, not Mistral), can’t compete or provide solutions without these companies’ contributions. That’s why so many people are eagerly waiting for Sesame to release their model and why there’s already frustration over the delays or limited access.
Including Meta and other companies as part of the open-source ecosystem is fine as long as their corporate goals are not directly hinder them by the fact of making these models freely available (which is the case of Meta and several venture/state-backed corporations). While I do not consider them truly OS, they are enabling the contribution to the development of OS tools and techniques, as you have mentioned. But the moment they shift away from that, we’re left with nothing to continue working. We can’t do this alone, at least not yet. Hopefully, one day, we’ll be empowered enough to move forward on our own.
That said, I don’t consider these companies truly part of the open-source community. Releasing massive binary blobs of model weights without transparency or reproducibility is not open-source at all. At any point, they can decide that open weights no longer align with their business strategy and shift to a fully closed model, leaving us with no way forward. Their participation is conditional, driven by their own interests, not by a true commitment to open-source principles.
I think you’re arguing against a position I never took. I never confused fine-tuning with training from scratch—in fact, I made that distinction explicitly multiple times. My point was that the open-source community as a whole has more compute than any single corporation—that’s not a mischaracterisation; it’s a fact. The challenge isn’t raw capacity but coordination and efficiency.
Yes, foundational models like Whisper, SD, and LLaMA required massive investment, but they didn’t come out of nowhere—they’re built on years of open research. Corporations capitalise on OS breakthroughs, scale them, and package them into business-friendly releases. That doesn’t erase the fact that OS research was the backbone of many of these advancements.
OS might not train billion-scale models from scratch (yet), but it’s been at the forefront of optimisations, fine-tuning, and inference breakthroughs. Falcon, Mistral, and SDXL weren’t just corporate efforts—they were shaped by an active OS community. Saying OS “waits” ignores the fact that companies like Mistral and Together open-source models specifically so the community can iterate on them.
Sesame put in real effort training CSM, especially in transforming acoustic tokens. But let’s not pretend it’s some walled-off, proprietary monolith. It still relies on OS tools—WhisperX, FasterWhisper Plus, Silero VAD, SGLang, and even its backbone model, Moshi—all fundamental to its pipeline. Could OS train a Moshi-scale foundational model today? Probably not. But can it replicate key parts of the workflow and drive iteration? It already is.
This isn’t speculation—OS has done it before. ControlNet wasn’t from Stability AI, but it redefined image conditioning. Open-source forks like SDXL Turbo, Pytorch2.0 optimised Stable Diffusion, and Deep Floyd IF pushed multimodal generation further. Whisper started as an OpenAI release, but FasterWhisper, WhisperX, and Deepgram’s models have evolved it far beyond its original scope. Even LLaMA, once a closed Meta release, led to Mistral, Nous Hermes, and Qwen, all outpacing corporate offerings in key areas.
Nobody said OS could train a new model from scratch overnight. The real discussion is whether OS can optimise, refine, and adapt these models independently—and it already does. "The key is using a low learning rate and fine-tuning checkpoint by checkpoint to avoid distorting the pre-trained knowledge." That’s exactly what’s happening.
Corporations may lead foundational model training now, but that doesn’t mean it’ll always be that way. Compute costs are dropping, decentralised training is scaling, and open-source is already shaping SOTA models through iterative improvements. The argument that OS "waits" ignores that it actively builds upon and surpasses corporate efforts—often within months of release.
I'm just throwing the output to RVC for consistent voices. Easy.
Releasing MAYA weights and software would be a blast. Failing to do so will just keep you in the shadow of the "big boys".
LOL as if that ever worked - this gaslighting / blackmail shit wont get you very far - they got what they wanted from the demo ^^ they did release the base weights .. just finetune it and happy days
LOL as if that ever worked - this gaslighting / blackmail shit wont get you very far - they got what they wanted from the demo ^^ they did release the base weights .. just finetune it and happy days
blackmailing? I did not blackmail anyone. I just see nobody cares anymore about their demo (even if it's very good).
Do you guys know if, even if released in weights or when we finally have the whole source code, it will come with Maya and Miles, or would we need to train a new voice?