Indexing fails when parsing a file with single lines that are very long/wide

l7-ehumes commented 6 months ago

Before submitting your bug report

[X] I believe this is a bug. I'll try to join the Continue Discord for questions
[X] I'm not able to find an open issue that reports the same bug
[X] I've seen the troubleshooting guide on the Continue Docs

Relevant environment info

- OS: MacOS 14.5
- Continue: v0.8.27
- IDE: 1.89.1 (Universal

Description

Chunking doesn't appear to work on single large lines in a file. On my system the issue seems to happen at around 900 characters wide.

The error itself:

console.ts:137 [Extension Host] Error refreshing index:  Error: Invalid argument error: Values length 0 is less than the length (4096) multiplied by the value size (4096) for FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 4096)
    at LocalTable.add (/Users/erichumes/.vscode/extensions/continue.continue-0.8.25-darwin-arm64/out/extension.js:89739:25)
    at async addComputedLanceDbRows (/Users/erichumes/.vscode/extensions/continue.continue-0.8.25-darwin-arm64/out/extension.js:90033:15)
    at async _LanceDbIndex.update (/Users/erichumes/.vscode/extensions/continue.continue-0.8.25-darwin-arm64/out/extension.js:90112:13)
    at async CodebaseIndexer.refresh (/Users/erichumes/.vscode/extensions/continue.continue-0.8.25-darwin-arm64/out/extension.js:188535:47)
    at async _VsCodeExtension.refreshCodebaseIndex (/Users/erichumes/.vscode/extensions/continue.continue-0.8.25-darwin-arm64/out/extension.js:331710:26)

I can work around the issue by adding things to the .continueignore file, but that's difficult because I don't get any logging on which file the index process fails on. I used rg -l '.{900,}' . to attempt to find files with long/wide lines and manually add them to the ignore file (mostly js/vue front end files).

So, what I think would be a good solution would be to be able to chunk single lines if they also cross chunk boundaries (if that's possible), but probably the most important thing would be to add logging to the console or logs so that they can be manually added to the .continueignore file or fixed to avoid such wide lines.

To reproduce

An example license file that triggers the issue:

{"token": "XXXXXXXXXXXXXXXX.eyJfX0FETUlOX18iOnRydWUsIl9fQU5BTFlTSVNfXyI6dHJ1XXXXXXXXXXXXXXnRydWUsIl9fQlVJTERFUlNfXyI6dHJ1ZSwiX19DT05GSUdVUkFUSU9OX18iOnXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXPQVJEX18iOnRydWUsIl9fREFUQV9fIjp0cXXXXXXXXXXXXXXXXXXXXXXXXXXXXwiX19FTlRJVElFU19fIjp0cnVlLCJfX0lBTV9fIjp0cnVlLCJfX0lOR0VTVF9fIjp0cnVlLCJfX0lOVkVOVE9SWV9fIjp0cnVlLCJfX0tOT1dMRURHRV9XXXXXXXXXXXXXXXXXXXXXXXJ1ZSwiX19MT0NBVElPTl9fIjp0cnVlLCJfX01VTFRJX0VOVElUWV9fIjpXXXXXXXXXXXXXk9KRUNUU19fIjp0cnVlLCJfX1NBTVBMRVNfXyI6dHJ1ZSwiX19TRXXXXXXXXXXXXXXXX19TVE9SRV9fIjp0XXXXXXXXXXXBTElEQVRJT05fU1RBVFVTX18iOiJERVZFTE9QTUVOVCBCVUlMRC4gTkXXXXXXXXXXXXXXXXXXX9EVUNUSU9OIFVTRS4ifQ.xrkyE5X2CJMG2TbNwlXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXqitjXXcqWjLMh68bQp5Rj3L6hGNCNILCmLmQSDU9DLKNSH2P4HtFSIrbJAhvdRUnf3uvr_XXXXXXXXXXXXXXXXXXX", "jwks": {"crv": "P-384", "y": "XXXXXXXXXXXXXgO5WRIoN5EgKEEV-JEHQmrs_Hczp_krtXXXXXXXXXXkbwF333Z", "kty": "EC", "x": "-TXXXXXXXXXXXXXXXXXXfgjqsmjhhENdfUMnZLCRReXXXXXXXXXXXXXXXXXXXXXXXXXXXX"}}

Log output

console.ts:137 [Extension Host] Error refreshing index:  Error: Invalid argument error: Values length 0 is less than the length (4096) multiplied by the value size (4096) for FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 4096)
    at LocalTable.add (/Users/erichumes/.vscode/extensions/continue.continue-0.8.25-darwin-arm64/out/extension.js:89739:25)
    at async addComputedLanceDbRows (/Users/erichumes/.vscode/extensions/continue.continue-0.8.25-darwin-arm64/out/extension.js:90033:15)
    at async _LanceDbIndex.update (/Users/erichumes/.vscode/extensions/continue.continue-0.8.25-darwin-arm64/out/extension.js:90112:13)
    at async CodebaseIndexer.refresh (/Users/erichumes/.vscode/extensions/continue.continue-0.8.25-darwin-arm64/out/extension.js:188535:47)
    at async _VsCodeExtension.refreshCodebaseIndex (/Users/erichumes/.vscode/extensions/continue.continue-0.8.25-darwin-arm64/out/extension.js:331710:26)

sestinj commented 6 months ago

@l7-ehumes thanks for sharing this! We do index by line so I know exactly where the error is coming from. Are these files that you would actually like to be indexed? If not, we can easily clean this up by ignoring such files or truncating the lines. Otherwise if you do want it indexed, then I see the point about giving a proper warning

l7-ehumes commented 6 months ago

Personally, things like the license file I wouldn't care to be indexed.

A bit of an issue on my codebase, and I assume many others, is that we use a monorepo approach, so it's difficult to just "look" at the repo and figure out what's needed; that's a lot of setup to get it right. I've simply excluded things like the front end due the amount of files that are an issue (.svg image files, .js imported files, etc). I don't see much of a reason to index image files and the like, so it isn't an issue there either.

That said, I am not sure how useful the system is as I have it set up. The sheer amount of indexed files might be an antipattern here. Asking @codebase doesn't often return useful things. For example, it will grab a bunch of unimplemented doc files when asking certain things.

As I mentioned, I think a good feature would be to add error catching during the indexing phase so that users have an idea of which files are an issue. That's what took me the longest time. I suppose the better solution would be a toast message for those users not willing to look in the chromium debug tool, but that can wait for when the extension is more mature I would guess.

So, I guess options are:

Chunk the lines if possible
Ignore files with wide lines -- Toast the user if they find them and encourage them to add them to the .continueignore file. -- Allow an override mechanism to force chunking of those files (a long import line in the middle of a normal file)

l7-ehumes commented 6 months ago

I also had a few files that seem to make the initial indexing phase before we get a % or file readout take forever. That's a different issue I will raise I guess, but as part of this ask: I had to delete directories a few at a time to track down what was causing that phase to take forever. Turns out it was a few ~3k line .js files. I am not sure what's causing the issue with them... That's an issue from a usability standpoint. If I don't know what the issue is, I can't fix it, or help you fix it!

sestinj commented 6 months ago

I like the toast idea. It could get busy if many files fail, so my thought is to show a warning on the indexing indicator so that when you hover is can show something like "3 files failed to index. Click to show more" and then can use something like a popup to help the user add them to a .gitignore. That way it can probably turn into a nicer management screen that allows selection of entire files/folders to include/leave out.

And I hear you on the .js problem! Time to get to debugging the .wasm parsers probably : )

matbee-eth commented 6 months ago

How do I locate the file that is causing the issue?

pskl commented 6 months ago

It would be super useful if it could say the name of the file when it errors out so that we can act on it.

anrgct commented 5 months ago

My index error toast prompt is here, I'm not sure if a certain file is too long, and I can't see the log

Error indexing codebase: Error: Invalid argument error: Values length 8448 is less than the length (768) multiplied by the value size (768) for FixedSizeList(Field { name: "item",data type: Float32,nullable: true, dict id:0, dict is ordered:false, metadata:{}},768)Click to retry

trathailoi commented 5 months ago

me too

eyaltoledano commented 5 months ago

Also getting it despite adding to .continueignore

Error indexing codebase: Error: Invalid argument error: Values length 768 is less than the length (768) multiplied by the values sized (768) for FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 768)

JohnSmithToYou commented 5 months ago

Same issue here:

Error updating the vectordb::nomic-embed-text:latest index: Error: Invalid argument error: Values length 13824 is less than the length (768) multiplied by the value size (768) for FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 768)

Between this, "Unable to load language for file", "Chunk with more than 500 tokens constructed", binary files getting indexed, and assuming you can identify binary files by extension or depend on .continueignore to control indexing for complicated Unity/C# projects... I'm starting to lose hope. I really want to use Continue and deploy it in my company but not being able to index my own files is a non-starter. Sadly, Tabby can't pull their cool RAG implementation together either.

sestinj commented 5 months ago

I have a fix for this in the dev branch and I'll be pushing it to a new release in just a couple of hours here

l7-ehumes commented 4 months ago

@sestinj There isn't a connected commit to this issue, so I can't tell if it has been released in a stable build yet. I can't test for a few weeks, so feel free to close yourself if you are confident in the fix/released it.

pellet commented 3 months ago

Chunking is also freezing on me, some of the files have lines with over 3k characters. I managed to get past the issue temporarily by ignoring the long-line js files entirely with the .continueignore file. Maybe a quick fix to get people over the issue could be to just ignore lines which are too long for the embedder (over 500 characters?) and leave a warning in the log?

pellet commented 3 months ago

Same issue here:
Error updating the vectordb::nomic-embed-text:latest index: Error: Invalid argument error: Values length 13824 is less than the length (768) multiplied by the value size (768) for FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 768)
Between this, "Unable to load language for file", "Chunk with more than 500 tokens constructed", binary files getting indexed, and assuming you can identify binary files by extension or depend on .continueignore to control indexing for complicated Unity/C# projects... I'm starting to lose hope. I really want to use Continue and deploy it in my company but not being able to index my own files is a non-starter. Sadly, Tabby can't pull their cool RAG implementation together either.

I prevented binary files from being indexed in my c# solution(contrary to the .gitignore already ignoring them) by using the following lines in the .continueignore file:

**/bin/**/*
**/obj/**/*
**/build/**/*

lilith commented 3 months ago

I think my issue is another side of this problem (although why it's ignoring my .gitignore and indexing my bundled files in the first place is another problem).

Failing at char 32769


[2024-08-08T07:39:04] Error parsing line:  {"messageId":"c1eb7ad0-f97b-4137-b908-d87583738c1c","messageType":"readFile","data":"{\r\n  \"format...u003c/PkgMicrosoft_SourceLink_GitHub\u003e\r\n  \u003c/PropertyGroup\u003e\r\n\u003c/Project\u003e"} SyntaxError: Unexpected token { in JSON at position 32769```

mrgoonie commented 3 months ago

I got the "indexing error" too, but I'm not sure where to check the logs CleanShot 2024-08-20 at 18 26 38

continuedev / continue