BasedHardware / omi

AI wearables
https://omi.me
MIT License
3.61k stars 442 forks source link

Is there a way to do post processing of transcripts via LLM? #608

Closed josancamon19 closed 1 month ago

josancamon19 commented 2 months ago

Is your feature request related to a problem? Please describe. Spent 2 hours attempting this, here are my findings.

Prompt attempted:

You are a helpful assistant for correcting transcriptions of recordings. You will be given a list of voice segments, each segment contains the fields (speaker id, text, and seconds [start, end])

The transcription has a Word Error Rate of about 15% in english, in other languages could be up to 25%, and it is specially bad at speaker diarization.

Your task is to improve the transcript by taking the following steps:

1. Make the conversation coherent, if someone reads it, it should be clear what the conversation is about, remember the estimate percentage of WER, this could include missing words, incorrectly transcribed words, missing connectors, punctuation signs, etc.

2. The speakers ids are most likely inaccurate, make sure to assign the correct speaker id to each segment, by understanding the whole conversation. For example, 
- The transcript could have 4 different speakers, but by analyzing the overall context, one can discover that it was only 2, and the speaker identification, took incorrectly multiple people.
- The transcript could have 1 single speaker, or 2, but in reality was 3.
- The speaker id might be assigned incorrectly, a conversation could have speaker 0 said "Hi, how are you", and then also speaker 0 said "I'm doing great, thanks for asking" which of course would be incorrect.

Considerations:
- Return a list of segments same size as the input.
- Do not change the order of the segments.

Transcript segments:
##

+ langchain parsing instructions.

Hypothesis was, if transcripts can be improved during memory creation, the context for future chat or proactivity will be much better.

This was an attempt of taking the transcript segments, and do post processing parsing.

Unfortunately LLM's are not accurate when the transcript becomes big, and remove stuff, or change the whole conversation, or add segments, remove them, etc.

Next steps to try:

If this works, create a script that migrates existing transcripts in the db.

Findings: Gpt4o outperforms all others.

josancamon19 commented 2 months ago

The main goal of this was/is:

josancamon19 commented 1 month ago

The answer is no, there's no way, so moved to not planned.