PromtEngineer / localGPT

Chat with your documents on your local device using GPT models. No data leaves your device and 100% private.
Apache License 2.0
19.54k stars 2.19k forks source link

How to make LocalGPT to translate everything into English language before store and process inputs #725

Open PayteR opened 5 months ago

PayteR commented 5 months ago

Hi, first of all, i really like this project, it's better than PrivateGPT, thank you!

Secondly, I want to use LocalGPT for Slovak documents, but it's impossible because no LLM model can work with the Slovak language. I tried to make a small testing Python script that will:

And the result was actually very good. So I suggest that LocalGPT could have some configuration for translation models, that will translate to EN and only after that will be this data processed, so in ChromaDB will be texts stored only in EN and LLM will process text and give output in EN and this output will be translated and after that returned to the user. What do you think about this idea? Is it even possible? I'm a noob in Python so I can't tell, here is my script that could do that for summarisation:

from xml.etree.ElementInclude import include
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

model_name = 'Helsinki-NLP/opus-mt-sk-en'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
translator = pipeline('translation', model=model, tokenizer=tokenizer)

# include text from file data/acne.txt
with open('path/to/file.txt', 'r', encoding='utf-8') as file:
    text = file.read()

def chunk_text(text, max_length):
    return [text[i:i+max_length] for i in range(0, len(text), max_length)]

# Split the text into chunks
max_length = 500
chunks = chunk_text(text, max_length)

# Translate each chunk
translated_chunks = []
for chunk in chunks:
    output = translator(chunk, max_length=max_length)
    translated_chunks.append(output[0]['translation_text'])

# Combine the translated chunks
translated_text = ' '.join(translated_chunks)

# Summarize the translated text
summarizer = pipeline("summarization", model="Falconsai/medical_summarization")
summary_text = summarizer(translated_text, max_length=1342, min_length=150, do_sample=False)[0]['summary_text']

## Translate back!!!
model_name = 'Helsinki-NLP/opus-mt-en-sk'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
translator = pipeline('translation', model=model, tokenizer=tokenizer)

chunks = chunk_text(summary_text, max_length)
translated_chunks = []
for chunk in chunks:
    output = translator(chunk, max_length=max_length)
    translated_chunks.append(output[0]['translation_text'])

# Combine the translated chunks
summary_sk_text = ' '.join(translated_chunks)

print(summary_sk_text)
remireci commented 5 months ago

Hi. First of all, I would also like to congratulate PromptEngineer and all the other people working on localGPT for this amazing project. Thanks to PayteR for this comment. I'm working on an extension to localGPT to use French texts and translate the prompts and answers. I'm simply doing it next to the localGPT application and will use API's to add this functionality. (Right now I'm looking for a solution to correct diacritics. Didn't you have problems with diacritics in your translation?) However, the Llama model, since it is based on a transformer architecture, may respond just as accurately without translating the ingested data. It would be nice to hear if others are experimenting with this and what the results are.

NitkarshChourasia commented 3 months ago

@remireci You are thanking everyone. That is wonderful. But, are you satisfied with the performance of the LLM output? Because it takes so much time to output the answer.

Don't you think instead of using the models locally. (Security may be a thing here) It would be better for to use API provided by LLM models?

Yes, if this software is being deployed on very big hardware then it is worthwhile. But using this for something personal would be very ineffective, I guess so. What do you say?