Open bdb-dd opened 6 months ago
Articles that we do not have an English translation for:
docs-altinn-studio-analysis.csv
Field explanation:
is_missing - we are missing an English version of the article.
is_mislabeled - we have an article in the English sitemap that is not written in English
token_count - the number of tokens as counted by the OpenAI tokenizer.
actual_language - as evaluated by gemma-7b-it
model hosted on the Groq Cloud.
List of all documents from docs.altinn.studio, sorted by token count descending:
Description
We have identified some quality issues that must be addressed in order to be able to launch Altinn Assistant with sufficient answer quality.
Language
Due to the fact that all current LLMs are significantly better at coding and reasoning in English, we have implemented our core pipeline to be English only and translate the final result to the desired language. Most of the leading open source models are focusing primarily on English so it is likely that we will continue to see the best results with this approach.
Currently there are several documentation articles written in Norwegian, but labeled as English and vice versa. This prevents us from reliabily providing the correct background information to the LLM.
Article length
Despite the rapid progress in increasing context window sizes, current models exhibit poorer performance when provided with too much or irrelevant context. While effort is progressing on an improved chunking implementation, it will not fully compensate for a lack of logical separation of topics into multiple articles.