Abraxas-365 / langchain-rust

🦜️🔗LangChain for Rust, the easiest way to write LLM-based programs in Rust
MIT License
516 stars 67 forks source link

Document Loader Options #196

Open alexanderjophus opened 1 month ago

alexanderjophus commented 1 month ago

Is your feature request related to a problem? Please describe. My use case is; a cron job that trains on a git repo. I'm using git commit loader rather than file loader (not entirely sure if this is best for me).

The main thing for me is iteratively adding documents to a vector store, rather than once and done.

Describe the solution you'd like I'd love to filter which commits are loaded, similar to how I can filter for only files of certain extensions. In my specific scenario I'd only like to load commits I've not seen before.

Describe alternatives you've considered A generic filter/lambda type function that allows developers to plug in their own conditions on whether a commit should be loaded.

alexanderjophus commented 1 month ago

I think providing our own filter, that we could plugin here would be ideal: https://github.com/Abraxas-365/langchain-rust/blob/main/src/document_loaders/git_commit_loader/git_commit_loader.rs#L45

alexanderjophus commented 1 month ago

Also letting the user do their own mapping might be beneficial. I'm working on a PR that does both.

alexanderjophus commented 1 month ago

I've updated the PR (it's a work in progress still as I struggle my way through rust's compiler).

The PR/ticket now is to;

Current issue I'm facing is (and many others similar rooting from Repo fields are not Send nor Sync)

`RefCell<Vec<Vec<u8>>>` cannot be shared between threads safely within `gix::Repository`, the trait `Sync` is not implemented for `RefCell<Vec<Vec<u8>>>`, which is required by `gix::revision::walk::Info<'_>: std::marker::Send` 
if you want to do aliasing and mutation between multiple threads, use `std::sync::RwLock` instead required for `&gix::Repository` to implement `std::marker::Send`