Open l7-ehumes opened 6 months ago
@l7-ehumes thanks for sharing this! We do index by line so I know exactly where the error is coming from. Are these files that you would actually like to be indexed? If not, we can easily clean this up by ignoring such files or truncating the lines. Otherwise if you do want it indexed, then I see the point about giving a proper warning
Personally, things like the license file I wouldn't care to be indexed.
A bit of an issue on my codebase, and I assume many others, is that we use a monorepo approach, so it's difficult to just "look" at the repo and figure out what's needed; that's a lot of setup to get it right. I've simply excluded things like the front end due the amount of files that are an issue (.svg
image files, .js
imported files, etc). I don't see much of a reason to index image files and the like, so it isn't an issue there either.
That said, I am not sure how useful the system is as I have it set up. The sheer amount of indexed files might be an antipattern here. Asking @codebase
doesn't often return useful things. For example, it will grab a bunch of unimplemented doc files when asking certain things.
As I mentioned, I think a good feature would be to add error catching during the indexing phase so that users have an idea of which files are an issue. That's what took me the longest time. I suppose the better solution would be a toast message for those users not willing to look in the chromium debug tool, but that can wait for when the extension is more mature I would guess.
So, I guess options are:
.continueignore
file.
-- Allow an override mechanism to force chunking of those files (a long import line in the middle of a normal file)I also had a few files that seem to make the initial indexing phase before we get a % or file readout take forever. That's a different issue I will raise I guess, but as part of this ask: I had to delete directories a few at a time to track down what was causing that phase to take forever. Turns out it was a few ~3k line .js
files. I am not sure what's causing the issue with them... That's an issue from a usability standpoint. If I don't know what the issue is, I can't fix it, or help you fix it!
I like the toast idea. It could get busy if many files fail, so my thought is to show a warning on the indexing indicator so that when you hover is can show something like "3 files failed to index. Click to show more" and then can use something like a popup to help the user add them to a .gitignore. That way it can probably turn into a nicer management screen that allows selection of entire files/folders to include/leave out.
And I hear you on the .js problem! Time to get to debugging the .wasm parsers probably : )
How do I locate the file that is causing the issue?
It would be super useful if it could say the name of the file when it errors out so that we can act on it.
My index error toast prompt is here, I'm not sure if a certain file is too long, and I can't see the log
Error indexing codebase: Error: Invalid argument error: Values length 8448 is less than the length (768) multiplied by the value size (768) for FixedSizeList(Field { name: "item",data type: Float32,nullable: true, dict id:0, dict is ordered:false, metadata:{}},768)Click to retry
me too
Also getting it despite adding to .continueignore
Error indexing codebase: Error: Invalid argument error: Values length 768 is less than the length (768) multiplied by the values sized (768) for FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 768)
Same issue here:
Error updating the vectordb::nomic-embed-text:latest index: Error: Invalid argument error: Values length 13824 is less than the length (768) multiplied by the value size (768) for FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 768)
Between this, "Unable to load language for file", "Chunk with more than 500 tokens constructed", binary files getting indexed, and assuming you can identify binary files by extension or depend on .continueignore to control indexing for complicated Unity/C# projects... I'm starting to lose hope. I really want to use Continue and deploy it in my company but not being able to index my own files is a non-starter. Sadly, Tabby can't pull their cool RAG implementation together either.
I have a fix for this in the dev branch and I'll be pushing it to a new release in just a couple of hours here
@sestinj There isn't a connected commit to this issue, so I can't tell if it has been released in a stable build yet. I can't test for a few weeks, so feel free to close yourself if you are confident in the fix/released it.
Chunking is also freezing on me, some of the files have lines with over 3k characters. I managed to get past the issue temporarily by ignoring the long-line js files entirely with the .continueignore file. Maybe a quick fix to get people over the issue could be to just ignore lines which are too long for the embedder (over 500 characters?) and leave a warning in the log?
Same issue here:
Error updating the vectordb::nomic-embed-text:latest index: Error: Invalid argument error: Values length 13824 is less than the length (768) multiplied by the value size (768) for FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 768)
Between this, "Unable to load language for file", "Chunk with more than 500 tokens constructed", binary files getting indexed, and assuming you can identify binary files by extension or depend on .continueignore to control indexing for complicated Unity/C# projects... I'm starting to lose hope. I really want to use Continue and deploy it in my company but not being able to index my own files is a non-starter. Sadly, Tabby can't pull their cool RAG implementation together either.
I prevented binary files from being indexed in my c# solution(contrary to the .gitignore already ignoring them) by using the following lines in the .continueignore file:
**/bin/**/*
**/obj/**/*
**/build/**/*
I think my issue is another side of this problem (although why it's ignoring my .gitignore and indexing my bundled files in the first place is another problem).
Failing at char 32769
[2024-08-08T07:39:04] Error parsing line: {"messageId":"c1eb7ad0-f97b-4137-b908-d87583738c1c","messageType":"readFile","data":"{\r\n \"format...u003c/PkgMicrosoft_SourceLink_GitHub\u003e\r\n \u003c/PropertyGroup\u003e\r\n\u003c/Project\u003e"} SyntaxError: Unexpected token { in JSON at position 32769```
I got the "indexing error" too, but I'm not sure where to check the logs
Before submitting your bug report
Relevant environment info
Description
Chunking doesn't appear to work on single large lines in a file. On my system the issue seems to happen at around 900 characters wide.
The error itself:
I can work around the issue by adding things to the
.continueignore
file, but that's difficult because I don't get any logging on which file the index process fails on. I usedrg -l '.{900,}' .
to attempt to find files with long/wide lines and manually add them to the ignore file (mostly js/vue front end files).So, what I think would be a good solution would be to be able to chunk single lines if they also cross chunk boundaries (if that's possible), but probably the most important thing would be to add logging to the console or logs so that they can be manually added to the
.continueignore
file or fixed to avoid such wide lines.Related to: https://github.com/continuedev/continue/issues/1163
To reproduce
An example license file that triggers the issue:
Log output