autogluon / autogluon-rag

Retrieval-Augmented Generation in 3 Lines of Code!
https://auto.gluon.ai/rag/dev/index.html
Apache License 2.0
27 stars 6 forks source link

Data Ingestion Module #1

Closed shreyash2106 closed 5 months ago

shreyash2106 commented 5 months ago

This PR is for initial work on the data ingestion module in the RAG pipeline. The current version supports reading from a directory containing multiple PDFs. The text from the documents is read and chunked up accordingly based on the provided chunk_size. Additionally, multi-threading is supported to process multiple files parallely.

This PR also sets up the usage of the AutoGluon RAG package with the agrag command. Further details are outlined in the README.

I have run black and isort on the codebase.

All unittests pass for the data ingestion module.

cheungdaven commented 5 months ago

LGTM, thanks! @shreyash2106