Open iuliaturc opened 2 months ago
What file types are most prominently problematic @iuliaturc?
This can be implemented by just having a set of default exclusions that we augment with the the exclusion-files parameter correct?
To provide more context for new contributors:
When indexing the codebase, we allow the user to specify "inclusion" and "exclusion" files. A sample exclusion file is sample-exclude.txt. Each line starts with one of these directives: ext
for extensions, dir
for directories and file
for files. For instance, ext:.png
instructs the indexing script to not include .png files in the vector database. The method that filters files based on inclusion/exclusion arguments is _should_include.
To exclude auto-generated files, we could do either or both of the following:
file:*_auto*.*
would exclude a file like scraper_auto_generated.py
.content
directive. For instance, content:THIS FILE WAS AUTOGENERATED
would exclude any file that contains this string.I would like to work on this. As per my understanding I just need to update _should_inlclude() function to handle content
and file:*_auto*.*
right?
@iuliaturc can i open a pr for this issue?
@kanakOS01 You're welcome to work on this, but the solution should be a bit more involved than simply matching "auto" in the filename. Often times, you can't tell if a file is auto-generated without reading its contents and looking for specific phrases there.
@iuliaturc right now i have implemented the functionality to exclude files with pattern like *_auto_*.*
and files with specific content like This file was autogenerated
. Should I extend the functionality such that when checking for pattern in the filename it also checks for certain content/phrases in the file. These phrases can be hardcoded.
Some repos have hundreds of auto-generated files, with limited utility in actually understanding the repo.
We'll have to implement some heuristics to detect such files (based on filename and content), and offer an option to exclude them from indexing. When they dominate the repo, they can really damage retrieval quality.