Closed DipFlip closed 1 month ago
You should be able to use the glob
syntax like .ignore
files allow you to. And by that logic inversing should work as well.
So to get only scattered pdfs
*, !*.pdf
untested - if that does not work that way - it should.
IMO the only reason it is like that is because of Langchains loader but in my head it should be a selector not an omission control and by default it omits nothing.
That's a good idea too. I've tested the github data collector and can't get it to collect all files of a type. I set the ignores to *
and !**/*.txt
and try to collect this test repo. It only finds files in the root folder and not subfolders.
opened an issue on langchain https://github.com/langchain-ai/langchainjs/issues/6214
Fixed the issue in https://github.com/langchain-ai/langchainjs/commit/36d8479166645f2bc66e4888fb70d969b1a3c51a so the pattern below should work once a new langchain release comes out and is updated here on anything-llm.
This is the correct pattern to select only pdf
files, including ones in subfolders:
*, !*/, !**/*.pdf
What would you like to see?
Just like we can write a list of files to IGNORE for the Github/Gitlab data connectors currently, tt would be nice to be able to collect files of ONLY certain types. A typical use case could be when a user wants to collect only pdf or text files scattered among lots of different filetypes.
This feature could look like a toggle by the current file filter. A user could toggle the file filter between ignoring the given patterns to selecting only files matching the patterns.