Storia-AI / sage

Chat with any codebase in under two minutes | Fully local or via third-party APIs
https://sage.storia.ai
Apache License 2.0
1.08k stars 91 forks source link

Feature request: Exclude auto-generated files #47

Open iuliaturc opened 2 months ago

iuliaturc commented 2 months ago

Some repos have hundreds of auto-generated files, with limited utility in actually understanding the repo.

We'll have to implement some heuristics to detect such files (based on filename and content), and offer an option to exclude them from indexing. When they dominate the repo, they can really damage retrieval quality.

mihail911 commented 2 months ago

What file types are most prominently problematic @iuliaturc?

This can be implemented by just having a set of default exclusions that we augment with the the exclusion-files parameter correct?

iuliaturc commented 1 month ago

To provide more context for new contributors:

When indexing the codebase, we allow the user to specify "inclusion" and "exclusion" files. A sample exclusion file is sample-exclude.txt. Each line starts with one of these directives: ext for extensions, dir for directories and file for files. For instance, ext:.png instructs the indexing script to not include .png files in the vector database. The method that filters files based on inclusion/exclusion arguments is _should_include.

To exclude auto-generated files, we could do either or both of the following:

  1. Allow patterns for these directives (not just exact string matching). For instance, file:*_auto*.* would exclude a file like scraper_auto_generated.py.
  2. Allow exclusions based on file content. We could add a content directive. For instance, content:THIS FILE WAS AUTOGENERATED would exclude any file that contains this string.
kanakOS01 commented 1 month ago

I would like to work on this. As per my understanding I just need to update _should_inlclude() function to handle content and file:*_auto*.* right?

kanakOS01 commented 1 month ago

@iuliaturc can i open a pr for this issue?

iuliaturc commented 1 month ago

@kanakOS01 You're welcome to work on this, but the solution should be a bit more involved than simply matching "auto" in the filename. Often times, you can't tell if a file is auto-generated without reading its contents and looking for specific phrases there.

kanakOS01 commented 1 month ago

@iuliaturc right now i have implemented the functionality to exclude files with pattern like *_auto_*.* and files with specific content like This file was autogenerated. Should I extend the functionality such that when checking for pattern in the filename it also checks for certain content/phrases in the file. These phrases can be hardcoded.