issues
search
CrawlScript
/
WebCollector
WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
https://github.com/CrawlScript/WebCollector
GNU General Public License v3.0
3.07k
stars
1.45k
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
BerkeleyDBReader读取berkerly种子历史文件,种子信息少了,而且执行的次数也少一次
#84
haixingmu
closed
6 years ago
1
Exception when updating db, java.lang.InterruptedException,org.openqa.selenium.remote.UnreachableBrowserException: Error communicating with the remote browser. It may have died.
#83
mdzz9527
closed
6 years ago
3
Bug with depth
#82
Aki1996
closed
6 years ago
1
关于IP代理的问题
#81
zhangzhengk
closed
6 years ago
4
设置了Config.MAX_EXECUTE_COUNT,但是因超时而失败的种子好像没有再次抓取,这是怎么回事
#80
haixingmu
closed
6 years ago
6
depth太大可能导致OOM
#79
carryxyh
closed
6 years ago
1
depth过大导致内存溢出
#78
carryxyh
closed
6 years ago
9
当我连续爬取时出现403?怎么解决
#77
df8305909
closed
6 years ago
1
能否实现重复爬取URL
#76
weiyinfu
closed
6 years ago
1
正文提取问题
#75
Kaneki-x
closed
6 years ago
1
多个爬虫同时爬取
#74
ljc930611
closed
6 years ago
5
为什么用WebCollector的2.7.1版本拿不到图片数据了呢?
#73
kongbb1
closed
6 years ago
1
2.70版本HttpRequest中的setUserAgent()方法无效
#72
yanzuo1992
closed
6 years ago
6
移除unused变量&优化代码
#71
feifeiiiiiiiiiii
closed
6 years ago
0
能否根据不同类型的seed实际情况,可以配置分配线程的数量?
#70
Janus-Xu
closed
6 years ago
1
打包中的源码注释是乱码的
#69
Janus-Xu
closed
6 years ago
1
有没有集群方案和demo?单机单线程测试爬取速度不理想
#68
Janus-Xu
closed
6 years ago
4
爬取给定url的所有子页面,子子页面
#67
yaoyuanyy
closed
7 years ago
3
增强时间提取功能添加一个gitignore file
#66
iimokay
closed
6 years ago
0
CrawlDatumFormater.class bug问题
#65
ljc930611
closed
7 years ago
0
CrawlDatums next中添加后续任务的问题
#64
ljc930611
closed
6 years ago
5
jar包缺少问题
#63
ljc930611
closed
7 years ago
4
将url添加到CrawlDatums 不生效
#62
wuxiongliu1
closed
6 years ago
2
能否添加多个正则规则呢?
#61
wuxiongliu1
closed
7 years ago
1
如何在visited方法中把拼接的url放入到url队列中?
#60
wuxiongliu1
closed
7 years ago
1
调用api 接口用什么方法返回数据呢?
#59
wuxiongliu1
closed
7 years ago
1
现在maven仓库的版本是多少呢?
#58
wuxiongliu1
closed
6 years ago
1
关于正则的问题
#57
wuxiongliu1
closed
6 years ago
1
添加序列化
#56
LinuxSuRen
closed
6 years ago
0
Fix broken headings in Markdown files
#55
bryant1410
opened
7 years ago
0
readme的是什么格式啊,怎么看来是乱的
#54
LinuxSuRen
closed
7 years ago
0
delete the readme
#53
Junjiu
opened
7 years ago
0
爬取的页面内部链接能修改么
#52
zbcdj2008
closed
6 years ago
2
想请教下,爬取的url信息储存到BDB中字段属性的相关说明
#51
newCheng
opened
7 years ago
1
教程链接挂了,能解决下吗
#50
ahd763810566
closed
7 years ago
1
设置timeout 以及设置 重连次数
#49
loseyou
closed
6 years ago
1
新版本plugin下怎么没有的Mongo了?
#48
bulletmarker
opened
8 years ago
0
Update ContentExtractor time parser
#47
sundy-li
opened
8 years ago
0
ContentExtractor时间解析不准确
#46
sundy-li
opened
8 years ago
0
WebCollector 2.31 选择器(select bug)
#45
stoneJava
closed
8 years ago
5
java.lang.NoClassDefFoundError: org/openqa/selenium/htmlunit/HtmlUnitDriver
#44
huangwenyan1225
closed
6 years ago
4
深度爬取,存储berkeleydb错误,爬取完成不释放内存
#43
123yxp123
closed
6 years ago
2
Exception in thread "main" java.lang.NoClassDefFoundError: com/sleepycat/je/EnvironmentConfig
#42
AceLee39
opened
8 years ago
1
crawlDatums.add(datum) 之后不继续执行
#41
chen28683
closed
8 years ago
1
爬虫的start代表爬行的深度,这个名字起得有点误解,不看代码以为线程启动呢
#40
91demo
closed
8 years ago
1
很现实的一个问题,爬取网站受到网站访问频率限制
#39
91demo
opened
8 years ago
5
解析某些页面会出现死锁,如内容中列出的页面
#38
titibaba
opened
8 years ago
0
DemoDepthCrawler好像不能正常工作
#37
lewiswu1209
closed
8 years ago
3
webcollector帮助文档问题
#36
leiyz
closed
6 years ago
2
最新版的有在重定向时set cookie吗?
#35
zemochen
closed
8 years ago
1
Previous
Next