Open kgilpin opened 2 months ago
Title: Enhance File Indexing to Include Constants in Files
Problem: The current file indexing process can miss defined constants in files. For example, constants like 'APPMAP_NAVIE_TOKEN_LIMIT' and others are not present in the index for certain files, which can lead to incomplete or inaccurate search results when trying to locate symbols and terms.
Analysis:
The root cause of this issue seems to be that the querySymbols
function, which is responsible for identifying and extracting symbols from the code, may not be capturing all the necessary symbols effectively, particularly constants defined in the code. This is likely because querySymbols
focuses on specific programming constructs like classes and functions but might overlook others like constants or static properties.
Additionally, the suggestion has been made to consider indexing the entire file as raw text instead of relying solely on symbol extraction. This approach might extract additional useful tokens that would otherwise be ignored, especially in cases where comments or non-standard code structures define valuable information.
Proposed Changes:
Enhance querySymbols
function:
querySymbols
function to ensure it identifies constants and static variables along with other symbols. This might involve refining the regex patterns used for symbol extraction to capture a broader range of constructs.Index Entire File as Text:
querySymbols
, the raw tokens are still available in the index. This change requires updating the function responsible for reading and preparing file content for indexing.Update indexFile
Method in FileIndex
:
indexFile
method within the FileIndex
class. This will involve processing the entire file content and including it into the terms
before inserting into the database. This should be done in addition to the existing behavior rather than replacing it, to retain backward compatibility.Re-index Existing Files:
These changes aim to improve the coverage and accuracy of the file index, thus enhancing the capabilities of the search and retrieval system across the application.
We could try tokenizing with a simple split(/[^\w\d]/)
, then maybe we drop some language specific keywords?
This file contains the following code:
Note that none of
appmap navie token limit
are present in the index.--
querySymbols
function should identify all symbols present in the code.We should consider simply indexing the entire file as text. We will pull in some additional matches based on code comments, but that may be a net gain anyway.