coder-hxl / x-crawl

Flexible Node.js AI-assisted crawler library
https://coder-hxl.github.io/x-crawl/
MIT License
1.54k stars 95 forks source link
ai ai-crawl chromium crawl crawler fingerprint flexible javascript multifunction nodejs puppeteer spider typescript

x-crawl · npm NPM Downloads GitHub license

English | 简体中文

x-crawl is a flexible Node.js AI-assisted crawler library. Flexible usage and powerful AI assistance functions make crawler work more efficient, intelligent and convenient.

It consists of two parts:

If you find x-crawl helpful, or you like x-crawl, you can give x-crawl repository a like on GitHub A star. Your support is the driving force for our continuous improvement! thank you for your support!

Features

Sponsors

capsolver

🎁Discover seamless automatic captcha solving with our AI-powered Auto Web Unblock technology!

123proxy

🎁企业级 HTTP 代理 IP ! 现免费测试 2-4 小时 + 15% 返现活动

AI assisted crawler

With the rapid development of network technology, website updates have become more frequent, and changes in class names or structures often bring considerable challenges to crawlers that rely on these elements. Against this background, crawlers combined with AI technology have become a powerful weapon to meet this challenge.

First of all, changes in class names or structures after website updates may cause traditional crawler strategies to fail. This is because crawlers often rely on fixed class names or structures to locate and extract the required information. Once these elements change, the crawler may not be able to accurately find the required data, thus affecting the effectiveness and accuracy of data crawling.

However, crawlers combined with AI technology are better able to cope with this change. AI can also understand and parse the semantic information of web pages through natural language processing and other technologies to more accurately extract the required data.

To sum up, crawlers combined with AI technology can better cope with the problem of class name or structure changes after website updates.

Example

The combination of crawler and AI allows the crawler and AI to obtain pictures of high-rated vacation rentals according to our instructions:

import { createCrawl, createCrawlOpenAI } from 'x-crawl'

// Create a crawler application
const crawlApp = createCrawl({
  maxRetry: 3,
  intervalTime: { max: 2000, min: 1000 }
})

// Create AI application
const crawlOpenAIApp = createCrawlOpenAI({
  clientOptions: { apiKey: process.env['OPENAI_API_KEY'] },
  defaultModel: { chatModel: 'gpt-4-turbo-preview' }
})

// crawlPage is used to crawl pages
crawlApp.crawlPage('https://www.airbnb.cn/s/select_homes').then(async (res) => {
  const { page, browser } = res.data

  // Wait for the element to appear on the page and get the HTML
  const targetSelector = '[data-tracking-id="TOP_REVIEWED_LISTINGS"]'
  await page.waitForSelector(targetSelector)
  const highlyHTML = await page.$eval(targetSelector, (el) => el.innerHTML)

  // Let AI obtain image links and remove duplicates
  const srcResult = await crawlOpenAIApp.parseElements(
    highlyHTML,
    `Get the image link, don't source it inside, and de-duplicate it`
  )

  browser.close()

  // crawlFile is used to crawl file resources
  crawlApp.crawlFile({
    targets: srcResult.elements.map((item) => item.src),
    storeDirs: './upload'
  })
})

[!TIP] Can even send the whole HTML to the AI to help us operate, because the website content is more complex you also need to describe the location to get more accurately, and will consume a lot of Tokens.

Even if the subsequent update of the website causes the class name or structure to change, it can climb to the data normally, because we no longer rely on the fixed class name or structure to locate and extract the required information, but let the AI understand and parse the semantic information of the web page, so as to extract the required data more efficiently, intelligently and conveniently.

See the HTML that the AI needs to process For ease of viewing, it is formatted here ```html
第 1 页,共 3 页
第 1 页,共 3 页
```
View the srcResult (img url) returned by AI after parsing the HTML according to our instructions ```json { "elements": [ { "src": "https://z1.muscache.cn/im/pictures/miso/Hosting-45937791/original/c67d32ed-21eb-4066-8cef-650dcd45bada.jpeg?im_w=720" }, { "src": "https://z1.muscache.cn/im/pictures/df3493cf-39b2-46cc-9e85-7ef186980f25.jpg?im_w=720" }, { "src": "https://z1.muscache.cn/im/pictures/52d375d3-5e54-444b-8186-15e61a592d9a.jpg?im_w=720" }, { "src": "https://z1.muscache.cn/im/pictures/4ce87a7c-cbce-4e6e-97ea-38840518e1c4.jpg?im_w=720" }, { "src": "https://z1.muscache.cn/im/pictures/miso/Hosting-661881998531696630/original/c7f7769f-e56c-4d55-8e74-06fdaf3e048d.jpeg?im_w=720" }, { "src": "https://z1.muscache.cn/im/pictures/miso/Hosting-50620715/original/650ba8af-3f77-41ce-8c93-0cf502a8656d.jpeg?im_w=720" }, { "src": "https://z1.muscache.cn/im/pictures/b899a44f-e5dd-4ee8-9116-13a5c79fb3d6.jpg?im_w=720" }, { "src": "https://z1.muscache.cn/im/pictures/a2820abe-20bc-4898-a0ee-17f3c974158b.jpg?im_w=720" }, { "src": "https://z1.muscache.cn/im/pictures/1f55a7c1-021f-4eb5-8e35-6473e16d7fef.jpg?im_w=720" }, { "src": "https://z1.muscache.cn/im/pictures/5205dac7-dd2a-4f91-8027-a4c0e52b4fae.jpg?im_w=720" }, { "src": "https://z1.muscache.cn/im/pictures/miso/Hosting-792178978933830608/original/75a7613c-e435-45fb-9db4-e4163921254b.jpeg?im_w=720" }, { "src": "https://z1.muscache.cn/im/pictures/bafaacfa-1644-4a3b-9165-bcd831924cc6.jpg?im_w=720" } ], "type": "multiple" } ```

[!WARNING] x-crawl is for legal use only. Any illegal activity using this tool is prohibited. Please be sure to comply with the robots.txt file regulations of the target website. This example is only used to demonstrate the use of x-crawl and is not targeted at a specific website.

Getting Started

Documentations: English | 简体中文 | V9 English | V9 简体中文

Community: Discord

Issues

If you have questions, requirements, and good suggestions, you can raise Issues in GitHub Issues.

License

MIT License Copyright (c) 2023-present CoderHXL