This PR is for initial work on the data ingestion module in the RAG pipeline.
The current version supports reading from a directory containing multiple PDFs. The text from the documents is read and chunked up accordingly based on the provided chunk_size. Additionally, multi-threading is supported to process multiple files parallely.
This PR also sets up the usage of the AutoGluon RAG package with the agrag command. Further details are outlined in the README.
This PR is for initial work on the data ingestion module in the RAG pipeline. The current version supports reading from a directory containing multiple PDFs. The text from the documents is read and chunked up accordingly based on the provided
chunk_size
. Additionally, multi-threading is supported to process multiple files parallely.This PR also sets up the usage of the AutoGluon RAG package with the
agrag
command. Further details are outlined in theREADME
.I have run
black
andisort
on the codebase.All unittests pass for the data ingestion module.