context-labs / autodoc

Experimental toolkit for auto-generating codebase documentation using LLMs
MIT License
1.93k stars 113 forks source link

Error during traversal: The text contains a special token that is not allowed #22

Open slavakurilyak opened 1 year ago

slavakurilyak commented 1 year ago

When I run doc index on the langchain repository, I receive the following error:

⠇ Processing 494 files...Error during traversal: The text contains a special token that is not allowed: <|endoftext|>
Failed to find `autodoc.config.json` file. Did you run `doc init`?
Error: The text contains a special token that is not allowed: <|endoftext|>
    at module.exports.__wbindgen_error_new (/usr/local/Cellar/node/19.8.1/lib/node_modules/@context-labs/autodoc/node_modules/@dqbd/tiktoken/tiktoken_bg.cjs:398:17)
    at wasm://wasm/00b63e2e:wasm-function[15]:0xebb8
    at wasm://wasm/00b63e2e:wasm-function[154]:0x48af5
    at Tiktoken.encode (/usr/local/Cellar/node/19.8.1/lib/node_modules/@context-labs/autodoc/node_modules/@dqbd/tiktoken/tiktoken_bg.cjs:257:18)
    at processFile (file:///usr/local/Cellar/node/19.8.1/lib/node_modules/@context-labs/autodoc/dist/cli/commands/index/processRepository.js:24:40)
    at async file:///usr/local/Cellar/node/19.8.1/lib/node_modules/@context-labs/autodoc/dist/cli/utils/traverseFileSystem.js:42:21
    at async Promise.all (index 2)
    at async dfs (file:///usr/local/Cellar/node/19.8.1/lib/node_modules/@context-labs/autodoc/dist/cli/utils/traverseFileSystem.js:38:13)
    at async file:///usr/local/Cellar/node/19.8.1/lib/node_modules/@context-labs/autodoc/dist/cli/utils/traverseFileSystem.js:25:21
    at async Promise.all (index 0)

I believe this is an issue with autodoc, rather than the langchain repository, as I have followed the instructions in the README file and run doc init in the langchain repository before running doc index.

Here is some information about my environment:

Please let me know if there is any additional information I can provide or steps I can take to resolve this issue.

dahifi commented 1 year ago

Get the same problem trying to process the microsoft/semantic-kernel repo. Managed to get things working by catching the error, but it's a hack as I don't understand what's throwing it. src/cli/commands/index/processRepository.ts

     let summaryLength: number;
    try {
      summaryLength = encoding.encode(summaryPrompt).length;
    } catch (error) {
      console.error(
        `Error during encoding of summary prompt: ${(error as Error).message}`,
      );
      // set summaryLength to a default value
      summaryLength = 0;
    }

    let questionLength: number;
    try {
      questionLength = encoding.encode(questionsPrompt).length;
    } catch (error) {
      console.error(
        `Error during encoding of question prompt: ${(error as Error).message}`,
      );
      // set questionLength to a default value
      questionLength = 0;
    }
slavakurilyak commented 1 year ago

For langchain, I resolved the issue by deleting docs/modules/agents/toolkits/examples/openai_openapi.yml.

For semantic-kernel, I resolved the issue by deleting dotnet/src/SemanticKernel/Connectors/OpenAI/Tokenizers/Settings/encoder.json.

This issue is related to <|endoftext|> which is used when interacting with OpenAI. Since lanchain and semantic-kernel contain this special character in their repo, the doc index command fails.

Here's a possible fix: https://github.com/hwchase17/langchain/issues/923

@samheutmaker can you patch this?

samheutmaker commented 1 year ago

Sorry, have been swamped. I'll take a look at this when I get a second.