Tabular Data with LLMs

Problem Statement

Enhancing Interactions with Tabular Data Containing Numerical Values using Language Models

Background:

Language models, such as large language models (LLMs), have demonstrated exceptional capabilities in natural language understanding and generation tasks. However, their effectiveness in interacting with tabular data, particularly those containing numerical values, is limited. As a result, there is a need to enhance the interaction capabilities of LLMs with tabular data to unlock their full potential for various applications, including data analysis, decision-making, and automation.

Problem:

The existing challenge lies in developing an effective approach that enables LLMs to understand, interpret, and perform complex operations on tabular data with numerical values seamlessly. The problem can be broken down into the following sub-problems:

Tabular Data Understanding: LLMs need to comprehend the structure and semantics of tabular data, including recognizing column headers, understanding data types, and identifying relationships between columns.
Numerical Value Processing: LLMs should be capable of comprehending and manipulating numerical values in tabular data. This includes performing calculations, aggregations, comparisons, and transformations on numeric data, taking into account their context within the table.
Contextual Reasoning: LLMs should be able to reason contextually based on the tabular data and associated textual information. They should be capable of inferring meaningful insights, making predictions, and answering questions related to the tabular data.
Data Integration: To enhance their utility, LLMs need to integrate tabular data with other types of information, such as textual descriptions, queries, or user instructions. This integration should be seamless and allow for flexible querying, filtering, and summarization of tabular data.
Scalability and Efficiency: The proposed solution should be scalable to handle large tabular datasets efficiently. LLMs should be able to process and interact with tabular data in a time-efficient manner, allowing for real-time or near-real-time applications.

Objective:

The objective of this research is to develop novel techniques and methodologies to enhance the interaction capabilities of LLMs with tabular data containing numerical values. The aim is to enable LLMs to effectively understand, process, reason, and integrate tabular data, thereby expanding their applications and improving their usability in various domains that heavily rely on numerical tabular information.

Impact:

The proposed solution has the potential to revolutionize data analysis, decision-making, and automation by enabling LLMs to interact seamlessly with tabular data containing numerical values. This can benefit fields such as finance, healthcare, scientific research, and business intelligence, where numerical data is prevalent. By bridging the gap between language models and tabular data, this research can facilitate more efficient and intelligent data-driven decision-making processes.

Tried Approaches

We have explored various Langchain agents and ultimately decided on the Chain. Our initial choice was the Pandas Dataframe Agent, but it exceeded the token limit due to the extensive context it required, causing it to fail. We then tested the SQL Database Agent, which functioned well for some queries but reached the request limit because of its constant pinging of the model. Lastly, we opted for the SQLDatabaseChain agent, which was effective in most cases, but it too encountered token limit issues when processing queries with a large data range.

In Progress

As we are now considering using a vector database and similarity checks to fetch only relevant information, this approach could potentially address the token limit issue. By incorporating similarity checks, the system will only retrieve closely related data, thereby reducing the tokens required for processing.

If you have any suggestions to help mitigate the current limitation, do guide us!

KillerStrike17 / Tabular-Data-with-LLMs

readme