issues
search
Google1234
/
Information_retrieva_Projectl-
新闻检索:爬虫定向采集3-4个网页,实现网页信息的抽取、检索和索引。网页个数不少于10个,能按时间、相关度、热度等属性进行排序,并实现相似主题的自动聚类。可以实现:有相关搜索推荐、snippet生成、结果预览(鼠标移到相关结果, 能预览)功能
MIT License
128
stars
36
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
python main.py运行时distance.so报错,显示oserror invalid ELF header
#17
hatleon
opened
6 years ago
2
无法爬取新闻
#16
FengliLin
opened
6 years ago
0
OSError on Mac OS X 10. 11
#15
QilinGu
closed
7 years ago
2
下载数据的网盘链接已失效
#14
ybxgood
closed
8 years ago
2
需要建立一个正排索引
#13
Google1234
opened
8 years ago
0
采用编码压缩技术
#12
Google1234
opened
8 years ago
0
才从编码压缩技术
#11
Google1234
closed
8 years ago
0
点击关键字支持跳转
#10
Google1234
opened
8 years ago
0
整体框架 需要简化+代码规范化+优化内存
#9
Google1234
opened
8 years ago
0
去各种标点符号
#8
Google1234
opened
8 years ago
0
Bug:merge_file()中buff_size 设置较大1024*1024*10 ,长时间无文件输出
#7
Google1234
opened
8 years ago
1
简化:在合并索引文件过程,word相同时,没必要比较doc_id,因索引文件是按序产生
#6
Google1234
opened
8 years ago
0
索引文件 bug
#5
Google1234
closed
8 years ago
1
爬取网页数目足够时,没能及时停止爬虫
#4
Google1234
opened
8 years ago
1
无法抓取头条网首页新闻链接
#3
Google1234
opened
8 years ago
0
爬取网页只能在cmd下运行,不能直接通过运行.py文件实现
#2
Google1234
opened
8 years ago
0
匹配内容有时候会漏掉内容
#1
Google1234
closed
8 years ago
1