continuedev / continue

⏩ Continue is the leading open-source AI code assistant. You can connect any models and any context to build custom autocomplete and chat experiences inside VS Code and JetBrains
https://docs.continue.dev/
Apache License 2.0
18.76k stars 1.59k forks source link

Skipping the URL of documents with a 404 error. #2448

Closed loss-and-quick closed 2 weeks ago

loss-and-quick commented 4 weeks ago

Before submitting your bug report

Relevant environment info

- OS: NixOS 24.05
- Continue: v0.8.52
- IDE: VSCodium v1.93.1
- Model: Ollama v0.3.10
- config.json:

{
  "models": [
    {
      "title": "Qwen2.5 3B",
      "provider": "ollama",
      "model": "qwen2.5:3b"
    },
    {
      "title": "SambaNova Llama 3.1 405B",
      "provider": "sambanova",
      "model": "llama3.1-405b",
      "apiKey": "xxxxxxxxxxx"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Starcoder 3b",
    "provider": "ollama",
    "model": "starcoder2:latest"
  },
  "customCommands": [
    {
      "name": "test",
      "prompt": "{{{ input }}}\n\nWrite a comprehensive set of unit tests for the selected code. It should setup, run tests that check for correctness including important edge cases, and teardown. Ensure that the tests are complete and sophisticated. Give the tests just as chat output, don't edit any file.",
      "description": "Write unit tests for highlighted code"
    }
  ],
  "contextProviders": [
    {
      "name": "code",
      "params": {}
    },
    {
      "name": "docs",
      "params": {}
    },
    {
      "name": "diff",
      "params": {}
    },
    {
      "name": "terminal",
      "params": {}
    },
    {
      "name": "problems",
      "params": {}
    },
    {
      "name": "folder",
      "params": {}
    },
    {
      "name": "codebase",
      "params": {}
    },
    {
      "name": "issue",
      "params": {
        "repos": [
          {
            "owner": "continuedev",
            "repo": "continue"
          }
        ],
        "githubToken": "xxxxxxxx"
      }
    }
  ],
  "slashCommands": [
    {
      "name": "edit",
      "description": "Edit selected code"
    },
    {
      "name": "comment",
      "description": "Write comments for the selected code"
    },
    {
      "name": "share",
      "description": "Export the current chat session to markdown"
    },
    {
      "name": "cmd",
      "description": "Generate a shell command"
    },
    {
      "name": "commit",
      "description": "Generate a git commit message"
    }
  ],
  "embeddingsProvider": {
    // "provider": "transformers.js"
    "provider": "ollama",
    "model": "nomic-embed-text"
  },
  "docs": [
    {
      "title": "Telethon",
      "startUrl": "https://tl.telethon.dev/methods/index.html",
      "rootUrl": "https://tl.telethon.dev/index.html",
      "favicon": "https://docs.telethon.dev/favicon.ico",
      "maxDepth": 5
    }
  ],
  "disableIndexing": false,
  "allowAnonymousTelemetry": false,
  "experimental": {
    "useChromiumForDocsCrawling": false
  }
}

Description

When you use @docs, 404 error page texts are passed into the model

To reproduce

  1. Add Telethon API to @docs
  2. Re-index docs
  3. Try ask for something using Telethon @docs

Log output

logs of continue:

[Extension Host] Indexing new doc: https://tl.telethon.dev/methods/index.html
workbench.desktop.main.js:146 [Extension Host] [CheerioCrawler] Starting crawl from: https://tl.telethon.dev/methods/index.html - Max Depth: 3
workbench.desktop.main.js:146 [Extension Host] Crawl completed
workbench.desktop.main.js:146 [Extension Host] Creating embeddings for 1243 articles
workbench.desktop.main.js:146 [Extension Host] Adding 1241 embeddings to db
workbench.desktop.main.js:146 [Extension Host] Successfully indexed: https://tl.telethon.dev/methods/index.html

logs of prompt:

You seem to be lost! Don't worry, that's just Telegram's API being
    itself. Shall we go back to the Main Page?

You seem to be lost! Don't worry, that's just Telegram's API being
    itself. Shall we go back to the Main Page?

You seem to be lost! Don't worry, that's just Telegram's API being
    itself. Shall we go back to the Main Page?

You seem to be lost! Don't worry, that's just Telegram's API being
    itself. Shall we go back to the Main Page?

You seem to be lost! Don't worry, that's just Telegram's API being
    itself. Shall we go back to the Main Page?

...

Use the above documentation to answer the following question. You should not reference anything outside of what is shown, unless it is a commonly known concept. Reference URLs whenever possible using markdown formatting. If there isn't enough information to answer the question, suggest where the user might look to learn more.
sestinj commented 3 weeks ago

Thanks for pointing this out! Definitely not ideal. In the plans is a final revamp of the docs indexer in the next 1-2 weeks, which I expect to solve this and a handful of other remaining small problems. I'll update here again once we've done this

loss-and-quick commented 3 weeks ago

Thanks for pointing this out! Definitely not ideal. In the plans is a final revamp of the docs indexer in the next 1-2 weeks, which I expect to solve this and a handful of other remaining small problems. I'll update here again once we've done this

I found a solution to the 404 error, and I also found a very strange solution in the CheerioCrawler code. I will try to create a pull request now and really hope that I won't mess it up.