getappmap / appmap-js

Client libraries for AppMap
48 stars 17 forks source link

Use git attributes to identify binary files #2042

Open kgilpin opened 4 weeks ago

kgilpin commented 4 weeks ago

Git metadata can indicate which files are binary. Use git attributes information as well as file extension in the context search / lookup / collector to determine which files to treat as binary.

github-actions[bot] commented 4 weeks ago

Title: Integrate Git Attributes for Binary File Classification in Context Search

Problem: The system currently utilizes file extensions to identify binary files during context searches and lookups. However, this method can be imprecise as file extensions might not provide an accurate classification of binary files. Git provides a mechanism to identify binary files through git attributes, improving the accuracy of file classification. The task is to incorporate git attributes alongside file extensions in the context search process to leverage Git’s binary file classification.

Analysis: To accurately identify binary files, the system should combine the existing method of checking file extensions with Git's metadata. Git attributes can be configured to define whether specific files are treated as binary, which can override the default behavior based on file extensions. The system can practically utilize the git check-attr command to query git attributes applied to files and determine if they are declared as binary. Incorporating this into the current mechanism will help filter out files more effectively during the context search and lookup process.

Proposed Changes:

  1. File: packages/cli/src/fulltext/FileIndex.ts

    • Modify the filterFiles function to include logic that also checks git attributes to determine if a file should be considered binary.
    • Utilize the git check-attr command within the try-catch block of the function where file filtering is performed.
    • If git attributes mark a file as binary, skip adding it to result, regardless of its extension.
  2. File: packages/cli/src/fulltext/listGitProjectFiles.ts

    • Extend functions that collect file lists to optionally get git attribute status using git check-attr, ensuring the returned file list is annotated with their git-determined binary status if applicable.
  3. File: packages/scanner/src/lastGitOrFSModifiedDate.ts

    • Introduce a utility function that encapsulates the logic of determining a file's binary status by calling git check-attr and interpreting the result.
    • Ensure that this utility can be reused across different modules that need to identify binary files.
  4. Integration Points:

    • Update any part of the module that relies on file extension checks for binary files to utilize the new binary determination mechanism using Git attributes.
    • Integrate the binary determination results in processes that collect context or conduct searches to adhere to improved binary file recognition.
  5. Testing:

    • Create or modify unit tests in relevant test files, such as packages/cli/tests/unit/fulltext/listGitProjectFiles.spec.ts, to verify that the binary file identification process now accurately incorporates git attributes.
    • Test scenarios should cover both cases where file extensions and git attributes indicate binary status independently, including combinations where they might conflict.

By implementing these changes, we enhance the precision of context searches and file indexing by intelligently considering Git's mechanisms for binary file classification.