NaiboWang / EasySpider

A visual no-code/code-free web crawler/spider易采集:一个可视化浏览器自动化测试/数据采集/爬虫软件,可以无代码图形化的设计和执行爬虫任务。别名:ServiceWrapper面向Web应用的智能化服务封装系统。
https://www.easyspider.net
Other
34.12k stars 4.15k forks source link

CentOS下编译程序指南 #236

Closed fcityyyyy closed 9 months ago

fcityyyyy commented 9 months ago

按照源码中的编译说明,先编译的主程序ElectronJS, CentOS上下载安装了最新的chrome ,命令google-chrome-stable -version,显示Google Chrome 119.0.6045.159

也按照说明将/opt/google/chrome/,全部copy到了ElectronJS下,并重命名为chrome_linux64。

也下载了对应版本的chromedriver_linux64,放到了chrome_linux64下

npm install和npm install @electron-forge/cli -g 两个命令也都执行安装成功了(换了taobao源,npm安装过程中提示需要python3,也安装了python3.8.15,安装后命令执行成功)

但最后执行npm run start_direct,总是报错,

用root用户执行会报:

easy-spider@0.3.5 start_direct electron .

[1120/000559.944607:FATAL:electron_main_delegate.cc(294)] Running as root without --no-sandbox is not supported. See https://crbug.com/638180. /mysofts/crawler/EasySpider-0.3.5-c/ElectronJS/node_modules/electron/dist/electron exited with signal SIGTRAP

切换普通用户后执行报错:

easy-spider@0.3.5 start_direct electron .

[13824:1120/000541.120354:FATAL:setuid_sandbox_host.cc(158)] The SUID sandbox helper binary was found, but is not configured correctly. Rather than run without sandboxing I'm aborting now. You need to make sure that /mysofts/crawler/EasySpider-0.3.5-c/ElectronJS/node_modules/electron/dist/chrome-sandbox is owned by root and has mode 4755. /mysofts/crawler/EasySpider-0.3.5-c/ElectronJS/node_modules/electron/dist/electron exited with signal SIGTRAP

麻烦帮忙看看,是哪里出了问题?万分感谢!!!

NaiboWang commented 9 months ago

参考:https://stackoverflow.com/questions/59739113/running-hello-world-electron-app-in-linux

fcityyyyy commented 9 months ago

好的,我先去看看,非常感谢答复

fcityyyyy commented 9 months ago

根据说明,修改了权限,现在运行npm run start_direct,主程序可以跑起来了,

Snipaste_2023-11-21 18-27-213

也能浏览任务,

Snipaste_2023-11-21 18-58-662

但点击设计任务后,会报以下错误:

GET A MESSAGE: { type: 0, message: { id: 1 } } set socket_start (node:18384) UnhandledPromiseRejectionWarning: Error: spawn /mysofts/crawler/EasySpider-0.3.5-c/ElectronJS/chrome_linux64/chromedriver_linux64 EACCES at /mysofts/crawler/EasySpider-0.3.5-c/ElectronJS/node_modules/selenium-webdriver/remote/index.js:260:24 at process.processTicksAndRejections (node:internal/process/task_queues:95:5) (Use electron --trace-warnings ... to show where the warning was created) (node:18384) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag --unhandled-rejections=strict (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1) (node:18384) PromiseRejectionHandledWarning: Promise rejection was handled asynchronously (rejection id: 1) (node:18384) UnhandledPromiseRejectionWarning: Error: spawn /mysofts/crawler/EasySpider-0.3.5-c/ElectronJS/chrome_linux64/chromedriver_linux64 EACCES at /mysofts/crawler/EasySpider-0.3.5-c/ElectronJS/node_modules/selenium-webdriver/remote/index.js:260:24 at process.processTicksAndRejections (node:internal/process/task_queues:95:5) (node:18384) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag --unhandled-rejections=strict (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 3) (node:18384) UnhandledPromiseRejectionWarning: Error: spawn /mysofts/crawler/EasySpider-0.3.5-c/ElectronJS/chrome_linux64/chromedriver_linux64 EACCES at /mysofts/crawler/EasySpider-0.3.5-c/ElectronJS/node_modules/selenium-webdriver/remote/index.js:260:24 at process.processTicksAndRejections (node:internal/process/task_queues:95:5) (node:18384) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag --unhandled-rejections=strict (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 4)

GET A MESSAGE: { type: 0, message: { id: 2 } } set socket_flowchart

还请再帮忙看看是哪里出了问题?

单独运行chrome浏览器是可以的, Snipaste_2023-11-21 17-14-125

另外,运行npm run start_direct,主程序起来后,后台有如下报错,不知道有没有影响 [user1@cent11 ElectronJS]$ npm run start_direct

easy-spider@0.3.5 start_direct electron .

Server has started. server_address: http://localhost:8074 x64 /mysofts/crawler/EasySpider-0.3.5-c/ElectronJS/chrome_linux64/chromedriver_linux64 /mysofts/crawler/EasySpider-0.3.5-c/Elec tronJS/chrome_linux64/chrome /mysofts/crawler/EasySpider-0.3.5-c/ElectronJS/chrome_linux64/execute.sh linux A JavaScript error occurred in the main process Uncaught Exception: Error: EACCES: permission denied, open 'info.log' [18384:1121/111727.823623:ERROR:bus.cc(399)] Failed to connect to the bus: Could not parse server address: Unknown address type (examples of valid types are "tcp" and on UNIX "unix") [18384:1121/111727.823658:ERROR:bus.cc(399)] Failed to connect to the bus: Could not parse server address: Unknown address type (examples of valid types are "tcp" and on UNIX "unix") [18384:1121/111727.846200:ERROR:bus.cc(399)] Failed to connect to the bus: Could not parse server address: Unknown address type (examples of valid types are "tcp" and on UNIX "unix") [18384:1121/111727.912438:ERROR:bus.cc(399)] Failed to connect to the bus: Could not parse server address: Unknown address type (examples of valid types are "tcp" and on UNIX "unix")

以上非常非常感谢

NaiboWang commented 9 months ago

遇到的错误信息 UnhandledPromiseRejectionWarning: Error: spawn [...] EACCES 通常说明了以下两个主要问题:

权限问题:EACCES(Error Access)表明你执行 chromedriver_linux64 二进制文件时没有设置必要的执行权限,或者运行 Electron 应用程序的用户没有必要的权限。

未处理的承诺拒绝:意味着你的代码中存在一个被拒绝的承诺,且该拒绝没有被适当地通过 .catch 处理程序捕获,或者在 async 函数中没有被 try/catch 块捕获。

解决这些问题,可以按照以下步骤操作:

解决 EACCES 错误 确保执行权限: 确保 chromedriver_linux64 文件具有执行权限。你可以通过在终端中运行以下命令来设置它:

bash   chmod +x /mysofts/crawler/EasySpider-0.3.5-c/ElectronJS/chrome_linux64/chromedriver_linux64

检查所有者权限: 验证当前用户是否具有访问该文件的权限。如果不是,请使用 chown 或者 sudo 命令改变所有者或者允许当前用户访问该文件。

解决未处理的承诺拒绝问题 检查代码中所有的 promise: 查找代码中可能产生 UnhandledPromiseRejectionWarning 警告的 promise。对于每个 promise 或异步操作,请确保你有适当的错误处理机制,比如 .catch 块或者包含在 try/catch 结构中。

   someAsyncFunction()
       .then((result) => {
           // 处理结果
       })
       .catch((error) => {
           // 错误处理
           console.error(error);
       });

或者在 async 函数中:

   async function asyncCall() {
       try {
           let result = await someAsyncFunction();
           // 处理结果
       } catch (error) {
           // 错误处理
           console.error(error);
       }
   }

确保在应用程序中每个异步任务都被适当地管理和捕获错误,这样可以防止它们造成未处理的承诺拒绝警告。

fcityyyyy commented 9 months ago

好的,非常非常感谢,我再对照看看

fcityyyyy commented 9 months ago

按照回复修改了chromedriver_linux64的权限,加上执行权限就好了,主程序可以跑起来了,点设计新任务也能够设计了 非常感谢,

Snipaste_2023-11-22 51-35-582

按照编译说明,开始进行执行阶段程序的编译, 执行了 pip3 install -r requirements.txt,提示都成功, 第一次执行python3 easyspider_executestage.py --id [1],提示lxml模块没找到 pip3 list看了一下我这个环境确实没有安装上, 又pip3 install lxml安装了一下,pip3 list 也能看到这个库了, 再次执行python3 easyspider_executestage.py --id [1], 提示以下信息:

[user1@cent11 ExecuteStage]$ python3 easyspider_executestage.py --id [1]

Configurations: +------------------+------+-----------------------+ | Key | Type | Value | +------------------+------+-----------------------+ | id | list | [1] | | saved_file_name | str | | | user_data | bool | False | | config_folder | str | | | config_file_name | str | config.json | | read_type | str | remote | | headless | bool | False | | server_address | str | http://localhost:8074 | | version | str | 0.3.5 | +------------------+------+-----------------------+

linux ('64bit', 'ELF') Finding chromedriver in EasySpider /mysofts/crawler/EasySpider-0.3.5-c/ExecuteStage/ElectronJS

Absolute_user_data_folder: D:\Documents\Projects\EasySpider\ElectronJS\user_data

<selenium.webdriver.chrome.options.Options object at 0x7fb099ac03a0> id: 1 Save Name for task ID 1 is: 2023_11_22_20_57_20_236066 任务ID 1 的保存文件名为: 2023_11_22_20_57_20_236066 remote

Cannot automatically check new version, please use the following command to check whether a new version avaliable and upgrade by pip: pip index versions commandline_config pip install commandline --upgrade Task Name: 中国知网 任务名称: 中国知网 Traceback (most recent call last): File "/usr/local/python3/lib/python3.8/site-packages/selenium/webdriver/common/service.py", line 71, in start self.process = subprocess.Popen(cmd, env=self.env, File "/usr/local/python3/lib/python3.8/subprocess.py", line 858, in init self._execute_child(args, executable, preexec_fn, close_fds, File "/usr/local/python3/lib/python3.8/subprocess.py", line 1704, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: '../ElectronJS/chrome_win64/chromedriver_win64.exe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "easyspider_executestage.py", line 1395, in browser_t = MyChrome( File "/mysofts/crawler/EasySpider-0.3.5-c/ExecuteStage/myChrome.py", line 25, in init super().init(*args, **kwargs) # 调用父类的 init File "/usr/local/python3/lib/python3.8/site-packages/selenium/webdriver/chrome/webdriver.py", line 69, in init super().init(DesiredCapabilities.CHROME['browserName'], "goog", File "/usr/local/python3/lib/python3.8/site-packages/selenium/webdriver/chromium/webdriver.py", line 89, in init self.service.start() File "/usr/local/python3/lib/python3.8/site-packages/selenium/webdriver/common/service.py", line 81, in start raise WebDriverException( selenium.common.exceptions.WebDriverException: Message: 'chromedriver_win64.exe' executable needs to be in PATH. Please see https://chromedriver.chromium.org/home

我看出错提示好像是说FileNotFoundError: [Errno 2] No such file or directory: '../ElectronJS/chrome_win64/chromedriver_win64.exe',没有找到chromedriver_win64.exe这个文件,我这个是linux环境,应该是chromedriver_linux64这个文件才对啊。

是我哪里执行错了吗?

还请再帮忙看看,非常非常感谢

NaiboWang commented 9 months ago

直接修改代码中'../ElectronJS/chrome_win64/chromedriver_win64.exe'那行的路径为你Linux的chromedriver路径即可。

fcityyyyy commented 9 months ago

好的,非常非常感谢,我再对照看看

fcityyyyy commented 9 months ago

依据您的回复,我把easyspider_executestage.py中的chrome和chromedriver名称和路径修改了,

Snipaste_2023-11-23 30-14-295

现在运行python3 easyspider_executestage.py --id [1] 能够出来这样一个浏览器窗口 Snipaste_2023-11-23 30-36-678

不过后台还是报错有文件找不到, [user1@cent11 ExecuteStage]$ python3 easyspider_executestage.py --id [1]

Configurations: +------------------+------+-----------------------+ | Key | Type | Value | +------------------+------+-----------------------+ | id | list | [1] | | saved_file_name | str | | | user_data | bool | False | | config_folder | str | | | config_file_name | str | config.json | | read_type | str | remote | | headless | bool | False | | server_address | str | http://localhost:8074 | | version | str | 0.3.5 | +------------------+------+-----------------------+

Cannot automatically check new version, please use the following command to check whether a new version avaliable and upgrade by pip: pip index versions commandline_config pip install commandline --upgrade linux ('64bit', 'ELF') Finding chromedriver in EasySpider /mysofts/crawler/EasySpider-0.3.5-c/ExecuteStage/ElectronJS

Absolute_user_data_folder: D:\Documents\Projects\EasySpider\ElectronJS\user_data

<selenium.webdriver.chrome.options.Options object at 0x7f36023b13a0> id: 1 Save Name for task ID 1 is: 2023_11_23_08_25_29_830135 任务ID 1 的保存文件名为: 2023_11_23_08_25_29_830135 remote Task Name: 中国知网 任务名称: 中国知网 Traceback (most recent call last): File "easyspider_executestage.py", line 1404, in thread = BrowserThread(browser_t, i, service, File "easyspider_executestage.py", line 63, in init with open(stealth_path, 'r') as f: FileNotFoundError: [Errno 2] No such file or directory: '../ElectronJS/chrome_linux64/stealth.min.js'

我查了一下,这个目录确实没有这个js文件,但不知道从哪里去找, 麻烦再帮忙看看,非常非常感谢。


另外我以为是不是直接打包到主程序能够绕过这个问题,按照编译说明 执行generateExecutable_Linux64.sh,报如下错误:

[user1@cent11 ExecuteStage]$ ./generateExecutable_Linux64.sh rm: 无法删除"build": 没有那个文件或目录 rm: 无法删除"dist": 没有那个文件或目录 ./generateExecutable_Linux64.sh:行3: pyinstaller: 未找到命令 rm: 无法删除"../ElectronJS/chrome_linux64/easyspider_executestage": 没有那个文件或目录 cp: 无法获取"dist/easyspider_executestage" 的文件状态(stat): 没有那个文件或目录

这块也麻烦帮忙看看,非常非常感谢。

NaiboWang commented 9 months ago

ElectronJS文件夹下有这个文件,拷贝到指定目录即可。

下面的打包脚本是Ubuntu的,不能混用。

fcityyyyy commented 9 months ago

好的,我拷贝下看看,

另外打包脚本如果是Ubuntu下用的话,CentOS下问下要如何修改吗? 我看generateExecutable_Linux64.sh打包脚本是这样的: rm -r build rm -r dist pyinstaller -F --icon=favicon.ico easyspider_executestage.py rm ../ElectronJS/chrome_linux64/easyspider_executestage cp dist/easyspider_executestage ../ElectronJS/chrome_linux64/easyspider_executestage

这几行除了第三行,都是删除和拷贝文件的命令,不知道从何改起? 还请再帮忙指导下,非常非常感谢。

NaiboWang commented 9 months ago

不需要打包,能运行起来就行,一定要打包这个脚本可以不用改。

fcityyyyy commented 9 months ago

拷贝了stealth.min.js到chrome_linux64后,能够正常设计任务和保存任务了,

Snipaste_2023-11-24 43-40-675 Snipaste_2023-11-24 43-59-947

不过当点击调用任务的时候, Snipaste_2023-11-24 44-29-783

会报zha找不到execute.sh的错误, Snipaste_2023-11-24 43-09-730

我按照之前的说明,在ElectronJS目录下也没有找到这个文件,只找到execute_macos.sh 和execute.bat文件,

我试着修改execute_macos.sh这个文件,

!/bin/bash

echo "Executing EasySpider on MacOS"

./easyspider_executestage $1 $2 $3 $4 $5 $6 $7 $8 $9

但发现easyspider_executestage 这个文件也没有,按照编译说明,这似乎是执行阶段编译打包后产生的文件,

试着执行打包命令,

[user1@cent11 ExecuteStage]$ ./generateExecutable_Linux64.sh rm: 无法删除"build": 没有那个文件或目录 rm: 无法删除"dist": 没有那个文件或目录 ./generateExecutable_Linux64.sh:行3: pyinstaller: 未找到命令 rm: 无法删除"../ElectronJS/chrome_linux64/easyspider_executestage": 没有那个文件或目录 cp: 无法获取"dist/easyspider_executestage" 的文件状态(stat): 没有那个文件或目录

仍然还是报以上错误,并且我实际上也是想打包部署到服务器上使用的,

以上还请再帮忙看看我的问题出在了哪儿?非常非常感谢!

NaiboWang commented 9 months ago

https://github.com/NaiboWang/EasySpider/releases/download/v0.3.5/EasySpider_0.3.5_Linux_x64.tar.xz

下载这个包,然后搜索你要的文件。

fcityyyyy commented 9 months ago

好的,我试试,非常非常感谢

fcityyyyy commented 9 months ago

按照推荐的方法搜索拷贝两个文件到相应目录,不行,于是查看了execute.sh,发现执行文件的路径不对, 将内容修改为:

!/bin/bash

./easyspider_executestage $1 $2 $3 $4 $5 $6 $7 $8 $9 调用任务还是不行,主程序没有反应,浏览器界面不出来,也没有数据记录, Snipaste_2023-11-28 29-05-331

于是想是不是还是得CentOS环境打包编译执行阶段的程序,重新去执行编译generateExecutable_Linux64.sh,这个脚本去排查问题,发现是pyinstaller找不到,在脚本中指定pyintaller的绝对路径,又解决了提示python3 enable--share参数问题后,打包成功了, Snipaste_2023-11-28 29-48-611

dist目录下的easyspider_executestage也自动拷贝到chrome_linux下。 于是重新执行任务,还是不行,重新设计了个任务来执行,也还是不行。 Snipaste_2023-11-28 40-23-706

试着在ExecuteStage目录下执行python3 easyspider_executestage.py --id [2],也修改了config.json下的数据文件位置,也还是不行,提示如下,目录下也没有生成的数据文件。 [user1@cent11 ExecuteStage]$ python3 easyspider_executestage.py --id [2]

Configurations: +------------------+------+-----------------------+ | Key | Type | Value | +------------------+------+-----------------------+ | id | list | [2] | | saved_file_name | str | | | user_data | bool | False | | config_folder | str | | | config_file_name | str | config.json | | read_type | str | remote | | headless | bool | False | | server_address | str | http://localhost:8074 | | version | str | 0.3.5 | +------------------+------+-----------------------+

linux ('64bit', 'ELF') Finding chromedriver in EasySpider /mysofts/crawler/EasySpider-0.3.5-c/ExecuteStage/ElectronJS

Absolute_user_data_folder: /home/user1/crawler_data

<selenium.webdriver.chrome.options.Options object at 0x7f072c6863a0> id: 2 Save Name for task ID 2 is: 2023_11_28_12_19_34_045771 任务ID 2 的保存文件名为: 2023_11_28_12_19_34_045771 remote

Cannot automatically check new version, please use the following command to check whether a new version avaliable and upgrade by pip: pip index versions commandline_config pip install commandline --upgrade Traceback (most recent call last): File "easyspider_executestage.py", line 1362, in print("Task Name:", service["name"]) KeyError: 'name'


目前不知道从哪方面着手解决问题了,还请再帮忙看看,非常非常感谢。。

NaiboWang commented 9 months ago

参考:https://github.com/NaiboWang/EasySpider/issues/239

fcityyyyy commented 9 months ago

好的,我看看对照下

fcityyyyy commented 9 months ago

确实是我把执行任务的ID搞错了,我execution_instances下只有0.json和1.json。

python3 easyspider_executestage.py --id [0] 传值正确后就好了,能够抓到相关的数据,控制台也能看得到。

通过命令行./chrome_linux64/easyspider_executestage --id '[0]' --user_data 0 --server_address http://localhost:8074 --config_folder "/mysofts/crawler/EasySpider-0.3.5-c/ElectronJS/" --headless 0 --read_type remote --config_file_name config.json --saved_file_name 也能够抓到相关数据。

很是开心,非常非常感谢您的指导和帮助


现在就是只有在任务页面下点击【本地直接执行】不行,没有反应,后台也看不到报错,就只是正常的提示信息:

GET A MESSAGE: { type: 5, message: { id: 2, user_data_folder: '', execute_type: 0 } } { id: 2, user_data_folder: '', execute_type: 0 }

GET A MESSAGE: { type: 5, message: { id: 2, user_data_folder: '', execute_type: 0 } } { id: 2, user_data_folder: '', execute_type: 0 } 0.json 1.json 2.json

GET A MESSAGE: { type: 5, message: { id: 3, user_data_folder: '', execute_type: 1 } } { id: 3, user_data_folder: '', execute_type: 1 }

data目录下也看不到数据。

这个是和我用x11 forward的方式来打开的有关系吗?设计任务的时候可以正常设计和保存,不知道运行的时候为什么不行? 还请帮助再看看,非常非常感谢!

NaiboWang commented 9 months ago

本地直接执行需要依赖目录下的chrome_linux64/execute.sh文件,和设计任务的流程无关,其核心仍然是命令行调用脚本,CentOS下我也没有测试过,核心代码在ElectronJS文件夹下的main.js的76-78行以及341-347行,你可以自行调试下,如果调试不成功那就用命令行执行吧:

driverPath = path.join(__dirname, "chrome_linux64/chromedriver_linux64");
chromeBinaryPath = path.join(__dirname, "chrome_linux64/chrome");
execute_path = path.join(__dirname, "chrome_linux64/execute.sh");
let spawn = require("child_process").spawn;
if (process.platform != "darwin" && msg.message.execute_type == 1 && msg.message.id != -1) {
    let child_process = spawn(execute_path, parameters);
    child_process.stdout.on('data', function (data) {
        console.log(data.toString());
    });
}
fcityyyyy commented 9 months ago

好的,明白了,我再试试看,非常非常感谢