BuilderIO / gpt-crawler

Crawl a site to generate knowledge files to create your own custom GPT from a URL
https://www.builder.io/blog/custom-gpt
ISC License
18.14k stars 1.88k forks source link

Trying to Crawl site nothing working #139

Open upup666 opened 5 months ago

upup666 commented 5 months ago

Hello there Trying to crawl this site https://help.puzzlebot.top

Here is my config file

import { Config } from "./src/config";

export const defaultConfig: Config = {
  url: "https://help.puzzlebot.top",
  match: "https://help.puzzlebot.top/article**",
  maxPagesToCrawl: 300,
  outputFileName: "output.json",
  maxTokens: 2000000,
};

but its crawl online its name what to do?

Thank you

ashkkr commented 4 months ago

This is because playwriter by default looks for anchor tags to identify other links to go to. But the website you have mentioned does not use tags to link to other pages, but uses event handler to go to other pages.

In short it is a shortcoming of the crawler and not this gpt-crawler.