adrianlyjak / obsidian-aloud-tts

MIT License
12 stars 1 forks source link

audio is regenerated frequently #28

Closed HadesNinth closed 1 week ago

HadesNinth commented 4 weeks ago

When I exit the note, and re-enter it, and hit play, the audio sometimes gets regenerated again from scratch. But it also just generally happens when I change my typing cursor position inside the note, and hit play. I'm assuming this has something to do with my cursor position since the plugin always seem inclined to regenerate text it had already previously generated, because it can't seem to consistently remember which already-generated audio file pairs with which sentence, so all it can do is resort to generating new audio.

It also seems that the plugin also generates new audio directly from openai if I increase/decrease the speed. Isn't there a way to just make the audio faster without necessarily having to generate new audio? Like how you would download a video from youtube and play it at faster speeds just fine in VLC media player. This would also probably allow the plugin to play the audio at 3, 4, 5x, which would be really nice for digesting information from a note quickly without having to slowly read.

Honestly, all this comes down to is my wallet crying in agony :). It would also be nice if, in the future, we can utilize local models for notes. Or, at least, have audio be generated once and then imbedded into the note once and for all.

adrianlyjak commented 4 weeks ago

Sounds like you more or less want a "permanent" cache? Note that audio is more or less requested on-demand, and only sentence-by-sentence to keep costs down. Note that if you jump the cursor into the middle of a sentence, it will request starting from the cursor, so that counts as a new audio file. Also note that if the sentence is edited in any way, it will generate a new audio.

The audio cache is currently 8 hours. I opted for a short non-configurable duration, since an excessive cache size can really bloat your vault. (The files are mp3s in the .tts directory). I was running into issues with long lived caches where icloud based sync was effectively broken, preventing me from opening the vault on my phone. I'm looking into moving the cache onto the device to avoid sync issues (See https://github.com/adrianlyjak/obsidian-aloud-tts/issues/22). Once that's done, it might make sense to make the cache duration configurable in the settings, or add some way of marking certain files as having a permanent cache.

This would also probably allow the plugin to play the audio at 3, 4, 5x, which would be really nice for digesting information from a note quickly without having to slowly read.

Added https://github.com/adrianlyjak/obsidian-aloud-tts/issues/29 just now to track this. See that issue for background.

Honestly, all this comes down to is my wallet crying in agony :). It would also be nice if, in the future, we can utilize local models for notes. Or, at least, have audio be generated once and then imbedded into the note once and for all.

Are you seeing high costs? Are you a heavy user of the plugin? I wouldn't expect much more than a dollar or two per month for a moderate but frequent use.

adrianlyjak commented 4 weeks ago

This would also probably allow the plugin to play the audio at 3, 4, 5x, which would be really nice for digesting information from a note quickly without having to slowly read.

Also note that openAI supports speeds up to 4x. I opted for a max of 2x to keep the interaction simple. Faster than that, at least with the openAI API sound kind of crazy to me. Audio is completely unintelligible.

Here's an example of the above paragraph at 4x ☝️ fast.mp3.zip

adrianlyjak commented 3 weeks ago

Also, with regards to this

It would also be nice if, in the future, we can utilize local models for notes

Totally agreed. There's no easy to use or small models yet for js. I'm somewhat considering trying to get some of the python models running within node (so desktop only). They look like they're somewhat heavy (guesstimating using about 1gb ram, preferably vram). See https://github.com/adrianlyjak/obsidian-aloud-tts/issues/25

HadesNinth commented 3 weeks ago

Sounds like you more or less want a "permanent" cache? Note that audio is more or less requested on-demand, and only sentence-by-sentence to keep costs down. Note that if you jump the cursor into the middle of a sentence, it will request starting from the cursor, so that counts as a new audio file. Also note that if the sentence is edited in any way, it will generate a new audio.

Thank you for the information! I was confused at first on how or when exactly the plugin decides to generate new audio. I work with Evergreen notes a lot in Obsidian so a permanent cache is unnecessary since my notes change all the time. I saw the new update on cache duration and it's more or less what I wanted!

Are you seeing high costs? Are you a heavy user of the plugin? I wouldn't expect much more than a dollar or two per month for a moderate but frequent use.

I just went ahead and generated a bunch of new audio for a ton of notes I wanted to consume quickly, and there was that whole thing with the cursor location which confused me, and which caused me to re-generate one note 10 times. That's where the initial high costs came from. That's the wrong way I went about things. Generating audio on a need-to basis is a more cost-effective way and how your plugin was meant to be used. I've been using it for the past week, and costs are not an issue. It's very well optimized.

Also note that openAI supports speeds up to 4x. I opted for a max of 2x to keep the interaction simple. Faster than that, at least with the openAI API sound kind of crazy to me. Audio is completely unintelligible.

I wholeheartedly agree. I've heard some people who can listen up to 5x, but I can only do up to 3x myself. To keep the interface simple, we probably don't need to go all the way up to 4x. My main concern was speeding up the audio locally rather than through Open AI. It would be most cost-effective that way. I change speeds on the plugin a lot, back and forth, depending on if I want a fast, general understanding (2x) or a slow, deep understanding of a note (0.75-1x). The plugin regenerates audio a lot when I did this. I've been refraining from doing it too excessively.

Totally agreed. There's no easy to use or small models yet for js. I'm somewhat considering trying to get some of the python models running within node (so desktop only). They look like they're somewhat heavy (guesstimating using about 1gb ram, preferably vram). See #25

The main reason I advocate for this, also, is because we can use custom voices. I've found that, depending on who reads the notes for me, I could get a very different understanding of what even I wrote myself. Something about their tone of voice, and how they emphasize different words, I guess. Creativity and Innovation is weird. Eleven Labs integration could be done, but it's orders and magnitudes more expensive than Open AI. It's unfortunate that Open AI delayed the release of their cloning model.

adrianlyjak commented 1 week ago

Thanks for the feedback! I think all of your issues are addressed, or tracked in other issues.

The main reason I advocate for this, also, is because we can use custom voices. I've found that, depending on who reads the notes for me, I could get a very different understanding of what even I wrote myself.

Definitely keeping my eye on integrating custom voice models (some of them even allow "prompting" a voice personality rather than using a reference mp3). Ideally the models could be embedded in the plugin runtime, but that's likely a ways out and desktop only. Definitely something I'm keeping my eye on

If I missed anything, let me know, and I'll reopen the issue