Altinn / altinn-studio-docs

Documentation for Altinn 3
https://docs.altinn.studio
Apache License 2.0
9 stars 43 forks source link

Normalize documentation language and max length #1593

Open bdb-dd opened 6 months ago

bdb-dd commented 6 months ago

Description

We have identified some quality issues that must be addressed in order to be able to launch Altinn Assistant with sufficient answer quality.

Language

Due to the fact that all current LLMs are significantly better at coding and reasoning in English, we have implemented our core pipeline to be English only and translate the final result to the desired language. Most of the leading open source models are focusing primarily on English so it is likely that we will continue to see the best results with this approach.

Currently there are several documentation articles written in Norwegian, but labeled as English and vice versa. This prevents us from reliabily providing the correct background information to the LLM.

Article length

Despite the rapid progress in increasing context window sizes, current models exhibit poorer performance when provided with too much or irrelevant context. While effort is progressing on an improved chunking implementation, it will not fully compensate for a lack of logical separation of topics into multiple articles.

### Tasks
- [x] Compile a list of all articles, sorted by token length. Include the language tag and compute the actual language using an LLM.
- [ ] Ensure that all documents have an up to data English version, correctly tagged.
- [ ] Ensure that all documents have an up to data English version, correctly tagged.
bdb-dd commented 6 months ago

Articles that we do not have an English translation for:

docs-altinn-studio-analysis.csv

Field explanation: is_missing - we are missing an English version of the article. is_mislabeled - we have an article in the English sitemap that is not written in English token_count - the number of tokens as counted by the OpenAI tokenizer. actual_language - as evaluated by gemma-7b-it model hosted on the Groq Cloud.

bdb-dd commented 6 months ago

List of all documents from docs.altinn.studio, sorted by token count descending:

all-docs.csv