Google1234 Information_retrieva_Projectl- issues - Githubissues

Google1234 / Information_retrieva_Projectl-

新闻检索：爬虫定向采集3-4个网页，实现网页信息的抽取、检索和索引。网页个数不少于10个，能按时间、相关度、热度等属性进行排序，并实现相似主题的自动聚类。可以实现：有相关搜索推荐、snippet生成、结果预览(鼠标移到相关结果，能预览)功能

MIT License

128 stars 36 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

python main.py运行时distance.so报错，显示oserror invalid ELF header

#17 hatleon opened 6 years ago
2
无法爬取新闻

#16 FengliLin opened 6 years ago
0
OSError on Mac OS X 10. 11

#15 QilinGu closed 7 years ago
2
下载数据的网盘链接已失效

#14 ybxgood closed 8 years ago
2
需要建立一个正排索引

#13 Google1234 opened 8 years ago
0
采用编码压缩技术

#12 Google1234 opened 8 years ago
0
才从编码压缩技术

#11 Google1234 closed 8 years ago
0
点击关键字支持跳转

#10 Google1234 opened 8 years ago
0
整体框架需要简化+代码规范化+优化内存

#9 Google1234 opened 8 years ago
0
去各种标点符号

#8 Google1234 opened 8 years ago
0
Bug:merge_file()中buff_size 设置较大1024*1024*10 ，长时间无文件输出

#7 Google1234 opened 8 years ago
1
简化：在合并索引文件过程，word相同时，没必要比较doc_id,因索引文件是按序产生

#6 Google1234 opened 8 years ago
0
索引文件 bug

#5 Google1234 closed 8 years ago
1
爬取网页数目足够时，没能及时停止爬虫

#4 Google1234 opened 8 years ago
1
无法抓取头条网首页新闻链接

#3 Google1234 opened 8 years ago
0
爬取网页只能在cmd下运行，不能直接通过运行.py文件实现

#2 Google1234 opened 8 years ago
0
匹配内容有时候会漏掉内容

#1 Google1234 closed 8 years ago
1