Mintplex-Labs / anything-llm

The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, and more.
https://anythingllm.com
MIT License
25.31k stars 2.57k forks source link

[FEAT]: filter data connector for only selected filetypes #1959

Closed DipFlip closed 1 month ago

DipFlip commented 3 months ago

What would you like to see?

Just like we can write a list of files to IGNORE for the Github/Gitlab data connectors currently, tt would be nice to be able to collect files of ONLY certain types. A typical use case could be when a user wants to collect only pdf or text files scattered among lots of different filetypes.

This feature could look like a toggle by the current file filter. A user could toggle the file filter between ignoring the given patterns to selecting only files matching the patterns.

timothycarambat commented 3 months ago

You should be able to use the glob syntax like .ignore files allow you to. And by that logic inversing should work as well. So to get only scattered pdfs

*, !*.pdf

untested - if that does not work that way - it should.

IMO the only reason it is like that is because of Langchains loader but in my head it should be a selector not an omission control and by default it omits nothing.

DipFlip commented 3 months ago

That's a good idea too. I've tested the github data collector and can't get it to collect all files of a type. I set the ignores to * and !**/*.txt and try to collect this test repo. It only finds files in the root folder and not subfolders.

DipFlip commented 3 months ago

opened an issue on langchain https://github.com/langchain-ai/langchainjs/issues/6214

DipFlip commented 3 months ago

Fixed the issue in https://github.com/langchain-ai/langchainjs/commit/36d8479166645f2bc66e4888fb70d969b1a3c51a so the pattern below should work once a new langchain release comes out and is updated here on anything-llm.

This is the correct pattern to select only pdf files, including ones in subfolders:

*,  !*/,  !**/*.pdf