getappmap / appmap-js

Client libraries for AppMap
49 stars 17 forks source link

File index can miss defined constants #2038

Open kgilpin opened 2 months ago

kgilpin commented 2 months ago
2024-10-03T17:03:39.011Z appmap:file-index Indexing file path src/services/chatCompletion.ts with terms src services chat completion chatcompletion ts ai ai aichat aichat availability await chat chat chat chat chat chat chat chatcompletion chatcompletion chatcompletion chatcompletion chatcompletion chatcompletion chatcompletion check checkavailability chunk chunk codemessages completion completion completion completion completion completion completion completionchunk completionchunk completionresponse dispose handle handlerequest initialize key make makechat message messages models open open openai openai port prepare preparechat random randomkey ready refresh refreshmodels request response send sendchat stream streamchat tovs url vs vscode

This file contains the following code:

  get env(): Record<string, string> {
    const pref = ChatCompletion.preferredModel;
    return {
      OPENAI_API_KEY: this.key,
      OPENAI_BASE_URL: this.url,
      APPMAP_NAVIE_TOKEN_LIMIT: String(pref?.maxInputTokens ?? 3926),
      APPMAP_NAVIE_MODEL: pref?.family ?? 'gpt-4o',
    };
  }

Note that none of appmap navie token limit are present in the index.

--

querySymbols function should identify all symbols present in the code.

We should consider simply indexing the entire file as text. We will pull in some additional matches based on code comments, but that may be a net gain anyway.

github-actions[bot] commented 2 months ago

Title: Enhance File Indexing to Include Constants in Files

Problem: The current file indexing process can miss defined constants in files. For example, constants like 'APPMAP_NAVIE_TOKEN_LIMIT' and others are not present in the index for certain files, which can lead to incomplete or inaccurate search results when trying to locate symbols and terms.

Analysis: The root cause of this issue seems to be that the querySymbols function, which is responsible for identifying and extracting symbols from the code, may not be capturing all the necessary symbols effectively, particularly constants defined in the code. This is likely because querySymbols focuses on specific programming constructs like classes and functions but might overlook others like constants or static properties.

Additionally, the suggestion has been made to consider indexing the entire file as raw text instead of relying solely on symbol extraction. This approach might extract additional useful tokens that would otherwise be ignored, especially in cases where comments or non-standard code structures define valuable information.

Proposed Changes:

  1. Enhance querySymbols function:

    • Update the querySymbols function to ensure it identifies constants and static variables along with other symbols. This might involve refining the regex patterns used for symbol extraction to capture a broader range of constructs.
  2. Index Entire File as Text:

    • Modify the file indexing process to index entire files as plain text. This can be an additional step after symbol extraction, ensuring that even if some symbols are missed by querySymbols, the raw tokens are still available in the index. This change requires updating the function responsible for reading and preparing file content for indexing.
  3. Update indexFile Method in FileIndex:

    • Incorporate the new indexing strategy into the indexFile method within the FileIndex class. This will involve processing the entire file content and including it into the terms before inserting into the database. This should be done in addition to the existing behavior rather than replacing it, to retain backward compatibility.
  4. Re-index Existing Files:

    • After the changes have been implemented, there might be a need to re-index existing files to ensure all relevant symbols and terms are captured according to the newly improved strategy.

These changes aim to improve the coverage and accuracy of the file index, thus enhancing the capabilities of the search and retrieval system across the application.

dustinbyrne commented 1 month ago

We could try tokenizing with a simple split(/[^\w\d]/), then maybe we drop some language specific keywords?