Description

When the relational database table is a wide table, there are a lot of fields. In addition to many redundant fields, retrieving the full number of fields may also exceed the maximum sequence length accepted by the embedding model when performing summary embedding. As a result, the generated embedding cannot accurately reflect the semantic information of the summary. Therefore, for wide tables, I split the fields and the basic information of the table. If the number of fields in the table is too large, the fields will be divided into multiple chunks during summary, and the length of a chunk does not exceed the maximum sequence length of the embedding model. If the table is not wide, then the summary is the same as the original, and the table name and the table description and fields are in the same chunk. In the retrieval, the table name is retrieved first, then the table name (id) is used as filter, and the query is used for vector retrieval, and then the table name and table description are assembled with the field as the final result.

How Has This Been Tested?

Test summary of wide table and retrieve respectively in dbgpt/rag/assembler/tests/test_db_struct_assembler.py and dbgpt/rag/assembler/tests/test_embedding_assembler.py

Snapshots:

Checklist:

[x] My code follows the style guidelines of this project
[x] I have already rebased the commits and make the commit message conform to the project standard.
[x] I have performed a self-review of my own code
[x] I have commented my code, particularly in hard-to-understand areas
[ ] I have made corresponding changes to the documentation
[x] Any dependent changes have been merged and published in downstream modules

eosphoros-ai / DB-GPT

Feat rdb summary wide table #2035

Description

How Has This Been Tested?

Snapshots:

Checklist: