InfoSecInnovations / concierge

Repo for Concierge AI dev work
Apache License 2.0
149 stars 28 forks source link

Adding support for document ingestion #31

Open shipley-c opened 1 month ago

shipley-c commented 1 month ago

The system may already support some of these, but I wanted to see what could be done with adding support, either natively or by conversion, for more documents. These are things I might be able to assist with.

A process that might check URLs or source documents for updates that can remove the old version and ingest the newer version.

sebovzeoueb commented 1 month ago

Hey, that would be very helpful. I've actually been browsing some of the ingestion options this week and I'm in the process of adding a plain text one to catch anything that doesn't fit the other loaders.

A process that might check URLs or source documents for updates that can remove the old version and ingest the newer version.

I would wait a bit for this because we've moved to OpenSearch (that will be released very soon) and we're looking at improving our schemas to make this type of operation easier (for the release after next).

The CLI does already support whole directory ingesting, although it's not recursive (yet), we do also want to add it to the GUI app. One of the issues is that going forward we're going to have to distinguish between running on localhost and running over the network, because the networked version requires uploading the files (which we currently are) whereas the localhost version shouldn't need to do that.

If you're interested in helping out with some of these issues, I would advise pulling the development branch to see how Concierge is evolving since the last release, a lot of things are very different. This branch gets quite regular updates.

I'm looking at unifying the file loading a little because I found myself having to duplicate some code between CLI and GUI, so I want to sort that out.

A good starting point would be to try to implement one of your loaders following the format of the existing ones in the loaders directory. We've been using the Langchain ones because they are generally quite easy to implement and they cover a lot of document types already.

sebovzeoueb commented 1 month ago

I've made some fairly significant changes to loader implementation on the development branch. Please let me know if you see any further improvements we could make, and by all means have a bash at developing a loader!