digitalfabrik / integreat-chat

Interface to self-hosted large language models and vector databases to provide improved Integreat Chat functionality
https://integreat-app.de
MIT License
1 stars 0 forks source link

Use NLLB 200 for translations #80

Closed svenseeberg closed 6 days ago

svenseeberg commented 1 week ago

Replace the LLM translations with NLLB-200 3.3B model.

Fix #50

svenseeberg commented 1 week ago

We can use chunking to work around the token limit:

def split_text(text, max_length=500):
    sentences = text.split('.') "

    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if not sentence.strip():
            continue
        sentence = sentence.strip() + "."
        if len(current_chunk) + len(sentence) <= max_length:
            current_chunk += sentence + " " 
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence + " "

    if current_chunk.strip():
        chunks.append(current_chunk.strip())

    return chunks