BuilderIO / gpt-crawler

Crawl a site to generate knowledge files to create your own custom GPT from a URL
https://www.builder.io/blog/custom-gpt
ISC License
18.15k stars 1.88k forks source link

Help, why can I only climb to the first page of gitbook #91

Open wt195799611 opened 7 months ago

wt195799611 commented 7 months ago

I tried to crawl this page and could only crawl one page

wt195799611 commented 7 months ago

https://layerzero.gitbook.io/docs/

isarikaya commented 6 months ago

The ** pattern covers all subfolders and files from the specified point. config should be like this:

export const defaultConfig: Config = {
  url: "https://layerzero.gitbook.io/docs",
  match: "https://layerzero.gitbook.io/docs/**",
  maxPagesToCrawl: 10,
  outputFileName: "output.json",
};
BTNGaming commented 6 months ago

The ** pattern covers all subfolders and files from the specified point. config should be like this:

export const defaultConfig: Config = {
  url: "https://layerzero.gitbook.io/docs",
  match: "https://layerzero.gitbook.io/docs/**",
  maxPagesToCrawl: 10,
  outputFileName: "output.json",
};

So I tried this also:

export const defaultConfig: Config = {
  url: "https://overkillgaming.com",
  match: "https://overkillgaming.com/**",
  maxPagesToCrawl: 10,
  outputFileName: "output.json",
};

Problem is that it crawls the first page and stops. (Wordpress site)

Any resolution for this?

isarikaya commented 6 months ago

The ** pattern covers all subfolders and files from the specified point. config should be like this:

export const defaultConfig: Config = {
  url: "https://layerzero.gitbook.io/docs",
  match: "https://layerzero.gitbook.io/docs/**",
  maxPagesToCrawl: 10,
  outputFileName: "output.json",
};

So I tried this also:

export const defaultConfig: Config = {
  url: "https://overkillgaming.com",
  match: "https://overkillgaming.com/**",
  maxPagesToCrawl: 10,
  outputFileName: "output.json",
};

Problem is that it crawls the first page and stops. (Wordpress site)

Any resolution for this?

I ran it with your config and got the following result. Are you sure you followed all the steps correctly? output-1.json

BTNGaming commented 6 months ago

The ** pattern covers all subfolders and files from the specified point. config should be like this:

export const defaultConfig: Config = {
  url: "https://layerzero.gitbook.io/docs",
  match: "https://layerzero.gitbook.io/docs/**",
  maxPagesToCrawl: 10,
  outputFileName: "output.json",
};

So I tried this also:

export const defaultConfig: Config = {
  url: "https://overkillgaming.com",
  match: "https://overkillgaming.com/**",
  maxPagesToCrawl: 10,
  outputFileName: "output.json",
};

Problem is that it crawls the first page and stops. (Wordpress site) Any resolution for this?

I ran it with your config and got the following result. Are you sure you followed all the steps correctly? output-1.json

100%, too bad it's not at least 100kb in size though haha. Too small for uploading to chat gpt/open ai