Unified Knowledge Graph Construction from Heterogeneous Documents Assisted by Large Language Models
We have published the package to PyPi: Docs2KG,
You can install it via:
pip install Docs2KG
We have a demonstration to walk through the components of Docs2KG.
The downstream usage examples is also included.
Video is available at Demo Docs2KG
The tutorial details is available at Tutorial Docs2KG
Which includes:
We also provide the Example Codes in Example Codes Docs2KG
The source codes documentation is available at Docs2KG Documentation
Three pillars of the LLM applications in our opinion:
Most of the tools in the market nowadays are focusing on the Retrieval Augmented Generation (RAG) pipelines or How to get Large Language Models (LLMs) to run locally.
Typical tools include: Ollama, LangChain, LLamaIndex, etc.
However, to make sure the wider community can benefit from the latest research, we need to first solve the data problem.
The Wider community includes personal users, small business, and even large enterprises. Some of them might have developed databases, while most of them do have a lot of data, but they are all in unstructured form, and distributed in different places.
So the first challenges will be:
This package is a proposed solution to the above challenges.
Given the nature of unstructured and heterogeneous data, information extraction and knowledge representation pose significant challenges. In this package, we introduce Docs2KG, a novel framework designed to extract multi-modal information from diverse and heterogeneous unstructured data sources, including emails, web pages, PDF files, and Excel files. Docs2KG dynamically generates a unified knowledge graph that represents the extracted information, enabling efficient querying and exploration of the data. Unlike existing approaches that focus on specific data sources or pre-designed schemas, Docs2KG offers a flexible and extensible solution that can adapt to various document structures and content types. The proposed framework not only simplifies data processing but also improves the interpretability of models across diverse domains.
The overall architecture design will be shown in:
The data from multiple sources will be processed by the Dual-Path Data Processing. Some of the data, for example, the exported PDF files, Excel files, etc., they can be processed and handle by programming parser. So it will be converted generally into the markdown, and then transformed into the unified knowledge graph. For data like scanned PDF, images, etc., we will need the help from Doc Layout Analysis and OCR to extract the information, then we will convert the extracted information into the markdown, and then transformed into the unified knowledge graph.
Then the unified multimodal knowledge graph will be generated based on the outputs:
The unified multimodal knowledge graph will have mainly two aspects:
The overall steps include:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install -r requirements.dev.txt
pip install -e .
If you find this package useful, please consider citing our work:
@misc{sun2024docs2kg,
title={Docs2KG: Unified Knowledge Graph Construction from Heterogeneous Documents Assisted by Large Language Models},
author={Qiang Sun and Yuanyi Luo and Wenxiao Zhang and Sirui Li and Jichunyang Li and Kai Niu and Xiangrui Kong and Wei Liu},
year={2024},
eprint={2406.02962},
archivePrefix={arXiv},
primaryClass={cs.CL}
}