astramind-ai / Auralis

A Fast TTS Engine
https://astramind.ai
Other
322 stars 19 forks source link
tts tts-serving xttsv2

Auralis 🌌 (/auˈralis/)

Transform text into natural speech (with voice cloning) at warp speed. Process an entire novel in minutes, not hours.

What is Auralis? πŸš€

Auralis is a text-to-speech engine that makes voice generation practical for real-world use:

Quick Start ⭐

  1. Create a new Conda environment:

    conda create -n auralis_env python=3.10 -y
  2. Activate the environment:

    conda activate auralis_env
  3. Install Auralis:

    pip install auralis

and then you can try it out via python

from auralis import TTS, TTSRequest

# Initialize
tts = TTS().from_pretrained("AstraMindAI/xttsv2", gpt_model='AstraMindAI/xtts2-gpt')

# Generate speech
request = TTSRequest(
    text="Hello Earth! This is Auralis speaking.",
    speaker_files=['reference.wav']
)

output = tts.generate_speech(request)
output.save('hello.wav')

or via cli using the openai compatible server

auralis.openai --host 127.0.0.1 --port 8000 --model AstraMindAI/xttsv2 --gpt_model AstraMindAI/xtts2-gpt --max_concurrency 8 --vllm_logging_level warn  

You can see here for a more in-depth explanation or try it out with this example

Key Features πŸ›Έ

Speed & Efficiency

Easy Integration

Audio Quality

XTTSv2 Finetunes

You can use your own XTTSv2 finetunes by simply converting them from the standard coqui checkpoint format to our safetensor format. Use this script:

python checkpoint_converter.py path/to/checkpoint.pth --output_dir path/to/output

it will create two folders, one with the core xttsv2 checkpoint and one with the gtp2 component. Then create a TTS instance with

tts = TTS().from_pretrained("som/core-xttsv2_model", gpt_model='some/xttsv2-gpt_model')

Examples & Usage πŸš€

Basic Examples ⭐

Simple Text Generation ```python from auralis import TTS, TTSRequest # Initialize tts = TTS().from_pretrained("AstraMindAI/xttsv2", gpt_model='AstraMindAI/xtts2-gpt') # Basic generation request = TTSRequest( text="Hello Earth! This is Auralis speaking.", speaker_files=["speaker.wav"] ) output = tts.generate_speech(request) output.save("hello.wav") ```
Working with TTSRequest 🎀 ```python # Basic request request = TTSRequest( text="Hello world!", speaker_files=["speaker.wav"] ) # Enhanced audio processing request = TTSRequest( text="Pristine audio quality", speaker_files=["speaker.wav"], audio_config=AudioPreprocessingConfig( normalize=True, trim_silence=True, enhance_speech=True, enhance_amount=1.5 ) ) # Language-specific request request = TTSRequest( text="Bonjour le monde!", speaker_files=["speaker.wav"], language="fr" ) # Streaming configuration request = TTSRequest( text="Very long text...", speaker_files=["speaker.wav"], stream=True, ) # Generation parameters request = TTSRequest( text="Creative variations", speaker_files=["speaker.wav"], temperature=0.8, top_p=0.9, top_k=50 ) ```
Working with TTSOutput 🎧 ```python # Load audio file output = TTSOutput.from_file("input.wav") # Format conversion output.bit_depth = 32 output.channel = 2 tensor_audio = output.to_tensor() audio_bytes = output.to_bytes() # Audio processing resampled = output.resample(target_sr=44100) faster = output.change_speed(1.5) num_samples, sample_rate, duration = output.get_info() # Combine multiple outputs combined = TTSOutput.combine_outputs([output1, output2, output3]) # Playback and saving output.play() # Play audio output.preview() # Smart playback (Jupyter/system) output.save("processed.wav", sample_rate=44100) ```

Synchronous Advanced Examples 🌟

Batch Text Processing ```python # Process multiple texts with same voice texts = ["First paragraph.", "Second paragraph.", "Third paragraph."] requests = [ TTSRequest( text=text, speaker_files=["speaker.wav"] ) for text in texts ] # Sequential processing with progress outputs = [] for i, req in enumerate(requests, 1): print(f"Processing text {i}/{len(requests)}") outputs.append(tts.generate_speech(req)) # Combine all outputs combined = TTSOutput.combine_outputs(outputs) combined.save("combined_output.wav") ```
Book Chapter Processing ```python def process_book(chapter_file: str, speaker_file: str): # Read chapter with open(chapter_file, 'r') as f: chapter = f.read() # You can pass the whole book, auralis will take care of splitting request = TTSRequest( text=chapter, speaker_files=[speaker_file], audio_config=AudioPreprocessingConfig( enhance_speech=True, normalize=True ) ) output = tts.generate_speech(request) output.play() output.save("chapter_output.wav") ```

Asynchronous Examples πŸ›Έ

Basic Async Generation ```python import asyncio from auralis import TTS, TTSRequest async def generate_speech(): tts = TTS().from_pretrained("AstraMindAI/xttsv2", gpt_model='AstraMindAI/xtts2-gpt') request = TTSRequest( text="Async generation example", speaker_files=["speaker.wav"] ) output = await tts.generate_speech_async(request) output.save("async_output.wav") asyncio.run(generate_speech()) ```
Parallel Processing ```python async def generate_parallel(): tts = TTS().from_pretrained("AstraMindAI/xttsv2", gpt_model='AstraMindAI/xtts2-gpt') # Create multiple requests requests = [ TTSRequest( text=f"This is voice {i}", speaker_files=[f"speaker_{i}.wav"] ) for i in range(3) ] # Process in parallel coroutines = [tts.generate_speech_async(req) for req in requests] outputs = await asyncio.gather(*coroutines, return_exceptions=True) # Handle results valid_outputs = [ out for out in outputs if not isinstance(out, Exception) ] combined = TTSOutput.combine_outputs(valid_outputs) combined.save("parallel_output.wav") asyncio.run(generate_parallel()) ```
Async Streaming with Multiple Requests ```python async def stream_multiple_texts(): tts = TTS().from_pretrained("AstraMindAI/xttsv2", gpt_model='AstraMindAI/xtts2-gpt') # Prepare streaming requests texts = [ "First long text...", "Second long text...", "Third long text..." ] requests = [ TTSRequest( text=text, speaker_files=["speaker.wav"], stream=True, ) for text in texts ] # Process streams in parallel coroutines = [tts.generate_speech_async(req) for req in requests] streams = await asyncio.gather(*coroutines) # Collect outputs output_container = {i: [] for i in range(len(requests))} async def process_stream(idx, stream): async for chunk in stream: output_container[idx].append(chunk) print(f"Processed chunk for text {idx+1}") # Process all streams await asyncio.gather( *(process_stream(i, stream) for i, stream in enumerate(streams)) ) # Save results for idx, chunks in output_container.items(): TTSOutput.combine_outputs(chunks).save( f"text_{idx}_output.wav" ) asyncio.run(stream_multiple_texts()) ```

Core Classes 🌟

TTSRequest - Unified request container with audio enhancement 🎀 ```python @dataclass class TTSRequest: """Container for TTS inference request data""" # Request metadata text: Union[AsyncGenerator[str, None], str, List[str]] speaker_files: Union[List[str], bytes] # Path to the speaker audio file enhance_speech: bool = True audio_config: AudioPreprocessingConfig = field(default_factory=AudioPreprocessingConfig) language: SupportedLanguages = "auto" request_id: str = field(default_factory=lambda: uuid.uuid4().hex) load_sample_rate: int = 22050 sound_norm_refs: bool = False # Voice conditioning parameters max_ref_length: int = 60 gpt_cond_len: int = 30 gpt_cond_chunk_len: int = 4 # Generation parameters stream: bool = False temperature: float = 0.75 top_p: float = 0.85 top_k: int = 50 repetition_penalty: float = 5.0 length_penalty: float = 1.0 do_sample: bool = True ``` ### Examples ```python # Basic usage request = TTSRequest( text="Hello world!", speaker_files=["reference.wav"] ) # With custom audio enhancement request = TTSRequest( text="Hello world!", speaker_files=["reference.wav"], audio_config=AudioPreprocessingConfig( normalize=True, trim_silence=True, enhance_speech=True, enhance_amount=1.5 ) ) # Streaming long text request = TTSRequest( text="Very long text...", speaker_files=["reference.wav"], stream=True, ) ``` ### Features - Automatic language detection - Audio preprocessing & enhancement - Flexible input handling (strings, lists, generators) - Configurable generation parameters - Caching for efficient processing
TTSOutput - Unified output container for audio processing 🎧 ```python @dataclass class TTSOutput: array: np.ndarray sample_rate: int ``` ### Methods #### Format Conversion ```python output.to_tensor() # β†’ torch.Tensor output.to_bytes() # β†’ bytes (wav/raw) output.from_tensor() # β†’ TTSOutput output.from_file() # β†’ TTSOutput ``` #### Audio Processing ```python output.combine_outputs() # Combine multiple outputs output.resample() # Change sample rate output.get_info() # Get audio properties output.change_speed() # Modify playback speed ``` #### File & Playback ```python output.save() # Save to file output.play() # Play audio output.display() # Show in Jupyter output.preview() # Smart playback ``` ### Examples ```python # Load and process output = TTSOutput.from_file("input.wav") output = output.resample(target_sr=44100) output.save("output.wav") # Combine multiple outputs combined = TTSOutput.combine_outputs([output1, output2, output3]) # Change playback speed faster = output.change_speed(1.5) ```

Languages 🌍

XTTSv2 Supports: English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese (Simplified), Hungarian, Korean, Japanese, Hindi

Performance Details πŸ“Š

Processing speeds on NVIDIA 3090:

Memory usage:

Learn More πŸ”­

License

The codebase is released under Apache 2.0, feel free to use it in your projects.

The XTTSv2 model (and the files under auralis/models/xttsv2/components/tts) are licensed under the Coqui AI License.