Ciyfly / Argo

Argo is an automated general crawler for automatically obtaining website URLs . Argo 是一个自动化扫描器爬虫 用于自动化获取网站的URL 基于go-rod实现了静态和动态结合的方式来实现
GNU General Public License v3.0
211 stars 23 forks source link

师傅,什么时候支持批量url列表 #1

Closed hosolom closed 1 year ago

Ciyfly commented 1 year ago

好的收到加入待做感谢反馈

Ciyfly commented 1 year ago

https://github.com/Ciyfly/Argo/pull/3 增加了批量url的功能 是按顺序的执行 暂时不支持并发 可以下载最新版本测试 https://github.com/Ciyfly/Argo/releases/tag/v1.1

hosolom commented 1 year ago

师傅报这个错 /root/argo -f url.txt --format txt painc err: runtime error: invalid memory address or nil pointer dereference

Ciyfly commented 1 year ago

xd 抱歉我没测试好,这个bug解决了 也打最新的包了 https://github.com/Ciyfly/Argo/releases/tag/v1.2

hosolom commented 1 year ago

到第二个任务就一直卡死了,/root/result/dl01.dd.com/dd.dd.com.txt 内容空可以不生成文件呢

/root/argo  -f url.txt --format txt
[2023-03-16 13:31:50] [info]  [argo start]
[2023-03-16 13:31:50] [info]  target: https://dd.dd.com
[2023-03-16 13:31:54] [info]  [GET] https://dd.dd.com/
[2023-03-16 13:31:59] [info]  [tab  count] 2
[2023-03-16 13:31:59] [info]  [  result  ] 1
[2023-03-16 13:31:59] [info]  [   save   ] /root/result/dl01.dd.com/dd.dd.com.txt
[2023-03-16 13:31:59] [info]  target: https://aa.dd.com
[2023-03-16 13:32:00] [info]  [GET] https://aa.dd.com/
[2023-03-16 14:31:59] [error]  page https://aa.dd.com error: context deadline exceeded  sourceType:  sourceUrl:
Ciyfly commented 1 year ago

卡死是之前没验证url是否能访问 达到了默认设置的页面超时时间才结束,我改完上面的bug把验证是否能访问也改了 打包的是这个 https://github.com/Ciyfly/Argo/releases/tag/v1.2.1 你可以试试

hosolom commented 1 year ago
  1. 昨晚挂了下又卡死了
[2023-03-16 21:39:26] [info]  [argo start]
[2023-03-16 21:39:26] [info]  target: https://dl.baidu.com
[2023-03-16 21:39:26] [error]  The target is inaccessible https://dl01.baidu.com
[2023-03-16 21:39:26] [info]  target: https://app.baidu.com
[2023-03-16 21:39:27] [error]  The target is inaccessible https://appinfo22.baidu.com
[2023-03-16 21:39:27] [info]  target: https://big.baidu.com
[2023-03-16 21:39:28] [error]  The target is inaccessible https://bigapp-test.baidu.com
[2023-03-16 21:39:28] [info]  target: https://open.baidu.com
[2023-03-16 21:39:31] [info]  [GET] https://open.baidu.com/
[2023-03-16 21:39:34] [info]  [tab  count] 2
[2023-03-16 21:39:34] [info]  [  result  ] 1
[2023-03-16 21:39:34] [info]  [   save   ] /root/result/open.baidu.com/open.baidu.com.txt
[2023-03-16 21:39:34] [info]  target: https://appinfo.baidu.com
[2023-03-16 21:39:37] [info]  [GET] https://appinfo.baidu.com/
[2023-03-16 21:39:37] [error]  page https://appinfo.baidu.com error: navigation failed: net::ERR_ABORTED  sourceType:  sourceUrl:
  1. 爬取内容空可以不生成文件,不然不方便统计呢
Ciyfly commented 1 year ago

我才打包了新版本 结果空的话不生成文件,解决了一些bug,卡死的话可以临时 通过 -- tabtimeout 100 --browsertimeout 300 类似这样控制浏览器最长运行时间来强制关闭 这个后续我在研究下更好的办法 爬取不到的问题我也有空再看看 https://github.com/Ciyfly/Argo/releases/tag/v1.2.2

hosolom commented 1 year ago

还是会卡死。看了下站点是那种浏览器访问就下载一个下载文件 image

curl内容为:

<!DOCTYPE html>
<html>
<head>
<title>Welcome to OpenResty!</title>
<style>
    body {
        width: 35em;
        margin: 0 auto;
        font-family: Tahoma, Verdana, Arial, sans-serif;
    }
</style>
</head>
<body>
<h1>Welcome to OpenResty!</h1>
<p>If you see this page, the OpenResty web platform is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="https://openresty.org/">openresty.org</a>.<br/>
Commercial support is available at
<a href="https://openresty.com/">openresty.com</a>.</p>

<p><em>Thank you for flying OpenResty.</em></p>
</body>
</html>
Ciyfly commented 1 year ago

师傅,下载会卡死的原因应该是弹窗警告导致的,我本地测试解决可以了,打了新包 可以下载试试 https://github.com/Ciyfly/Argo/releases/tag/v1.2.3

Ciyfly commented 1 year ago

师傅可以加群交流哦 Argo交流群