BuilderIO / gpt-crawler

Crawl a site to generate knowledge files to create your own custom GPT from a URL
https://www.builder.io/blog/custom-gpt
ISC License
18.16k stars 1.88k forks source link

Adds for autoScroll for crawling the multi pages? #30

Open SOSONAGI opened 7 months ago

SOSONAGI commented 7 months ago

I just worked for our platform pages with origin code and that couldn't provide me full information on pages.

Therefore, i added autoScroll code in main.ts for this and it worked perfectly. (I think it is better than increasing the numbers of waitForSelectorTimeout.)

async function autoScroll(page: Page) {
  await page.evaluate(async () => {
    await new Promise<void>((resolve, reject) => {
      var totalHeight = 0;
      var distance = 100;
      var timer = setInterval(() => {
        var scrollHeight = document.body.scrollHeight;
        window.scrollBy(0, distance);
        totalHeight += distance;

        if (totalHeight >= scrollHeight) {
          clearInterval(timer);
          resolve();
        }
      }, 100);
    });
  });
}

if (process.env.NO_CRAWL !== "true") {
  const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks, log, pushData }) {
      try {
        if (config.cookie) {
          const cookie = {
            name: config.cookie.name,
            value: config.cookie.value,
            url: request.loadedUrl, 
          };
          await page.context().addCookies([cookie]);
        }

        const title = await page.title();
        log.info(`Crawling ${request.loadedUrl}...`);

        await page.waitForSelector(config.selector, {
          timeout: config.waitForSelectorTimeout,
        });

        await autoScroll(page);  

        const html = await getPageHtml(page);
        await pushData({ title, url: request.loadedUrl, html });

        if (config.onVisitPage) {
          await config.onVisitPage({ page, pushData });
        }

        await enqueueLinks({
          globs: [config.match],
        });
      } catch (error) {
        log.error(`Error crawling ${request.loadedUrl}: ${error}`);
      }
    },
    maxRequestsPerCrawl: config.maxPagesToCrawl,
    // headless: false,
  });

  await crawler.run([config.url]);
}

If you think this is good enough for crawling, hope this will be helpful for other users.

Thank you for your work btw!

I really appreciate for that!

Thank you.

franklili3 commented 6 months ago

Thank your code. How can I add this code to repo? Can you share completed code?

SOSONAGI commented 6 months ago

Just go to main.ts and copy and add above code! And run it!

franklili3 commented 6 months ago

Thanks.

---Original--- From: "SUN YOUNG @.> Date: Sat, Dec 16, 2023 18:10 PM To: @.>; Cc: @.**@.>; Subject: Re: [BuilderIO/gpt-crawler] Adds for autoScroll for crawling themulti pages? (Issue #30)

Just go to main.ts and copy and add above code! And run it!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>