perf: Optimize QA Model

Commit Url

Optimizing the response times of QA (Question Answering) models without compromising accuracy is a significant challenge, especially as the length of the context increases. You can find below the insights into how QA models work, strategies to reduce the response times, and lastly the Haystack tool to optimise the responses.

How QA Models Work?

QA models, particularly those based on transformers like BERT, RoBERTa, or DistilBERT, operate by understanding the context provided to them and then predicting the start and end tokens of the answer within the context. The process involves encoding the question and context as input embeddings and then processing these embeddings through the model's layers to understand the relationships between the question and the context. The model outputs probabilities for each token in the context being the start or end of the answer. We will use the best answer indexes as outputs here. [1]

Strategies for Shortening Response Times

Context Segmentation: Large contexts can be segmented into smaller chunks. By processing these smaller chunks independently, the model can focus on less text at a time, potentially reducing processing time. Efficient Batching: Grouping multiple questions or context chunks together into a single batch can reduce the total computation time, leveraging parallel processing capabilities of modern GPUs. (Probably we will not be able to group the questions, because the dialogues are single response.) Model Distillation: Using distilled versions of large models (e.g., DistilBERT instead of BERT) can reduce the complexity while retaining a significant portion of the original model's understanding capability. (Both versions will be used, if the response time decreases but accuracy stills, distill might be used.) Adaptive Context Selection: Implement pre-selection algorithms to choose the most relevant sections of the context for the question before passing it to the QA model. Techniques like keyword matching, semantic search, or smaller, faster models to predict relevancy can be effective.

Haystack for Optimisation

Haystack is the open source Python framework by deepset for building custom apps with large language models (LLMs). It lets you quickly try out the latest models in natural language processing (NLP) while being flexible and easy to use. [2]

Some of the Haystack Components

[3] [4] Document Store: Haystack allows the use of efficient document stores like Elasticsearch, FAISS, or Milvus, which can quickly retrieve relevant documents or context segments based on the question. This reduces the amount of text the QA model needs to process. Retriever-Reader Pipeline: Implementing a Retriever-Reader pipeline in Haystack, where the Retriever quickly identifies relevant documents or context segments and the Reader model then focuses on these to find the answer, can significantly reduce processing time. The Retriever step acts as a filter, limiting the amount of data the more computationally intensive Reader model needs to process. (the top_k parameter effects the behaviours of Retriever and Reader, the performance changes will be analysed according to accuracy, and response times)

query=item['Question'], params={"Retriever": {"top_k": 3}, "Reader": {"top_k": 5}}

These optimizations can be achieved without significantly compromising the accuracy of the answers provided by the model, ensuring a balance between performance and quality.

Dijital-Twin / model