BuilderIO / gpt-crawler

Crawl a site to generate knowledge files to create your own custom GPT from a URL
https://www.builder.io/blog/custom-gpt
ISC License
18.14k stars 1.88k forks source link

Does gpt-crawler server always return same site? #147

Closed kaibadash closed 4 months ago

kaibadash commented 4 months ago

I am trying gpt-crawler in server mode, but after the first response, the same responses are always returned. Is there a problem with my configuration or request?

config and server command

import { Config } from "./src/config";

export const defaultConfig: Config = {
  url: "https://google.com",
  match: "**",
  maxPagesToCrawl: 1,
  selector: "html",
  outputFileName: "output.json",
  maxTokens: 100000,
};
npm run start:server

request

$ cat tmp/crawl_example.json
{
  "url": "https://example.com/",
  "match": "https://example.com/**",
  "outputFileName": "output_example.json",
  "maxPagesToCrawl": 1,
  "maxTokens": 2000
}

$ curl -XPOST http://localhost:3000/crawl  -H "Content-Type: application/json" -d@tmp/crawl_example.json
[
  {
    "title": "Example Domain",
    "url": "https://example.com/",
    "html": "Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.\n\nMore information..."
  }
]%

# First request is LGTM

$ cat tmp/crawl_wikipedia.json
{
  "url": "https://wikipedia.org/",
  "match": "https://wikipedia.org/**",
  "outputFileName": "output_wikipedia.json",
  "maxPagesToCrawl": 1,
  "maxTokens": 2000
}

$ curl -XPOST http://localhost:3000/crawl  -H "Content-Type: application/json" -d@tmp/crawl_wikipedia.json
[
  {
    "title": "Example Domain",
    "url": "https://example.com/",
    "html": "Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.\n\nMore information..."
  }
]%

# Second request is same as first one.