coder-hxl / x-crawl

Flexible Node.js AI-assisted crawler library
https://coder-hxl.github.io/x-crawl/
MIT License
1.57k stars 95 forks source link

windows正常的功能在Linux无结果无报错 #97

Closed allmors closed 8 months ago

allmors commented 8 months ago

Bug 预期

希望能够正常返回数据

最小可重复的例子

我window上能正常使用的功能,在Linux上获取不到结果,也不报错,日志根据提示是完成的,我是通过fastify起api,debian 11 

fastify.post('/api/screenshoot', async function handler(request, reply) {
    const { url } = request.body
    if (!url) {
        return reply.send({ code: 0, msg: "url is required" })
    }
    try {
        const buffer = await screenshoot({ url })
        // 在发送非常规数据时。一定一定要指定响应数据类型
        reply.type("image/jpeg")
        return buffer
    } catch (error) {

    }
})

// screenshoot model
const res = await x.crawlPage({
            url,
            maxRetry: 10,
            viewport: { width: 1920, height: 1080 },
            // 为此次的目标统一设置指纹
            fingerprints: [
                // 设备指纹 1
                {
                    maxWidth: 1024,
                    maxHeight: 800,
                    platform: 'Windows',
                    mobile: 'random',
                    userAgent: {
                        value:
                            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36',
                        versions: [
                            {
                                name: 'Chrome',
                                // 浏览器版本
                                maxMajorVersion: 112,
                                minMajorVersion: 100,
                                maxMinorVersion: 20,
                                maxPatchVersion: 5000
                            },
                            {
                                name: 'Safari',
                                maxMajorVersion: 537,
                                minMajorVersion: 500,
                                maxMinorVersion: 36,
                                maxPatchVersion: 5000
                            }
                        ]
                    }
                },
                // 设备指纹 2
                {
                    platform: 'Windows',
                    mobile: 'random',
                    userAgent: {
                        value:
                            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.59',
                        versions: [
                            {
                                name: 'Chrome',
                                maxMajorVersion: 91,
                                minMajorVersion: 88,
                                maxMinorVersion: 10,
                                maxPatchVersion: 5615
                            },
                            { name: 'Safari', maxMinorVersion: 36, maxPatchVersion: 2333 },
                            { name: 'Edg', maxMinorVersion: 10, maxPatchVersion: 864 }
                        ]
                    }
                },
                // 设备指纹 3
                {
                    platform: 'Windows',
                    mobile: 'random',
                    userAgent: {
                        value:
                            'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0',
                        versions: [
                            {
                                name: 'Firefox',
                                maxMajorVersion: 47,
                                minMajorVersion: 43,
                                maxMinorVersion: 10,
                                maxPatchVersion: 5000
                            }
                        ]
                    }
                }
            ]
        })

        const { browser, page } = res.data
        // // Get a screenshot of the rendered page
        const buffer = await page.screenshot({ path: `../uploads/${host}_${Date.now()}.png` })
        console.log('Screen capture is complete')

        if (buffer) {
            page.close()
        }
        // close brower
        // browser.close()

        return buffer

报错信息

无报错,无结果返回

x-crawl 版本

latest

Node 版本

20.9.0

包管理器

pnpm

包管理器版本

latest

github-actions[bot] commented 8 months ago

Welcome to submit an issue for x-crawl for the first time

coder-hxl commented 8 months ago

这里 const { browser, page } = res.data 获取得到 data 吗

allmors commented 8 months ago

这里 const { browser, page } = res.data 获取得到 data 吗

linux获取不到的,我抛异常了,大致知道问题出在哪,x-crawl依赖puppeteer,安装的时候不安装chrome,但是我看puppeteer这里说的是会自动安装,我也尝试曲线救国安装了pnpm i puppeteer让它自动安装chrome,但是还是无法正常使用x-crawl

现在我打算手动安装chrome-linux64试试

14:29-补充说明: (已经是root权限)手动安装了一些列依赖,现在报一个错误 Running as root without --no-sandbox is not supported.,问题源:https://crbug.com/638180,按照pptr的用法,const browser = await puppeteer.launch({ args: ['--no-sandbox'] });,x-crawl支持设置这个吗,我看api文档没提到

coder-hxl commented 8 months ago

https://github.com/coder-hxl/x-crawl/blob/main/docs/cn.md#xcrawlconfig 看这里 puppeteerLaunch 选项

coder-hxl commented 8 months ago
import xCrawl from 'x-crawl'

const myXCrawl = xCrawl({ crawlPage: { puppeteerLaunch: { args: ['--no-sandbox'] } })
allmors commented 8 months ago
import xCrawl from 'x-crawl'

const myXCrawl = xCrawl({ crawlPage: { puppeteerLaunch: { args: ['--no-sandbox'] } })

是的,我就是这样的,感谢解答,可以使用了 image

coder-hxl commented 8 months ago

@allmors ok,可以用了那我就关闭这个 Issues 了

coder-hxl commented 8 months ago

下一个版本这个 puppeteerLaunch 会变成 puppeteerLaunchOptions 得注意一下,加入了 AI 后会有很多东西发生改变。

allmors commented 8 months ago

下一个版本这个 puppeteerLaunch 会变成 puppeteerLaunchOptions 得注意一下,加入了 AI 后会有很多东西发生改变。

加入的AI是收费模式还是说开放的,我们自己根据ai平台调api?

coder-hxl commented 8 months ago

要用到 openai 的 APIKey ,底层是对 openai 进行了封装。openai 的 APIKey 也有免费的渠道,到时候我也会在文档那发出来。

image

目前这几个方法已经实现了,后续可能加入更多。

想详细了解可以看看这里 https://github.com/coder-hxl/x-crawl/tree/embracingAI/packages/ai

coder-hxl commented 8 months ago

@allmors 这是个小示例,让 AI 帮你快速提取一些想要的内容

image image

传给 AI 的 HTML : image

结果:

{
  elements: [
    {
      src: 'https://z1.muscache.cn/im/pictures/miso/Hosting-45937791/original/c67d32ed-21eb-4066-8cef-650dcd45bada.jpeg?aki_policy=large'    },
    {
      src: 'https://z1.muscache.cn/im/pictures/df3493cf-39b2-46cc-9e85-7ef186980f25.jpg?aki_policy=large'
    },
    {
      src: 'https://z1.muscache.cn/im/pictures/52d375d3-5e54-444b-8186-15e61a592d9a.jpg?aki_policy=large'
    }
  ],
  type: 'multiple'
}

也可以将整个 HTML 传给 AI 帮我们操作,但是会消耗更多 Tokens