coderLMN / AutomatedDataCollectionWithR

《基于 R 语言的自动化数据采集技术》读者讨论区
28 stars 10 forks source link

Page244证书验证报错 #9

Open dayushan opened 7 years ago

dayushan commented 7 years ago
.libPaths("D:/R/library")
library(RCurl)
library(XML)
library(stringr)
library(plyr)
####因为内容是保存在一个HTTPS服务器上的,需要指定CA签名的位置
all_links<-character()
new_results<-'government/announcements?keywords=&announcement_filter_option=all&topics%5B%5D=all&departments%5B%5D=all&world_locations%5B%5D=all&from_date=&to_date=1%2F7%2F2010'
signatures=system.file("CurlSSL",cainfo='cacert.pem',package='RCurl')
while(length(new_results>0)){
    new_results<-str_c("https://www.gov.uk/",new_results)
    results<-getURL(new_results,cainfo=signatures)#ssl.verifypeer=FALSE,不验证时并不能获取公告的链接
    ##验证的时候报错 SSL certificate problem: unable to get local issuer certificate
    results_tree<-htmlParse(results)
    ##获得公告的链接   
    all_links<-c(all_links,xpathSApply(results_tree,'//li[@id]//a',xmlGetAttr,'href'))
    ##获得新闻页面的链接
    new_results<-xpathSApply(results_tree,'//nav[@id="show-more-documents"]//li[@class="next"]//a',xmlGetAttr,"href")

}
all_links[1]
length(all_links)

进行证书验证时报错:

Error in function (type, msg, asError = TRUE)  : 
SSL certificate problem: unable to get local issuer certificate

不进行证书验证又態下载公告的链接,即all_links为list() 请教下老师,RCurl包在进行证书验证报错应该怎么解决呢

coderLMN commented 7 years ago

我运行这些代码成功了,除了第一行 .libPaths("D:/R/library") 没有用到。得到的结果如下:

all_links[1] [1] "/government/speeches/secretary-of-state-for-culture-media-and-sport-written-statement-on-exercise-of-functions-under-the-public-libraries-and-museums-act-1964" length(all_links) [1] 2097

dayushan commented 7 years ago

.libPaths("D:/R/library") 这个语句是指定库,用了没问题,但跑完代码报错 image RGUI 3.2.5版本。 请问下老师这是什么原因?怎么解决呢

coderLMN commented 7 years ago

我没有遇到这个问题,所以没办法 debug。但是查了一下错误信息,找到了这个 stackoverflow 的帖:http://stackoverflow.com/questions/22537180/error-while-publishing-in-r-pubs ,里面说:

Add an .Rprofile file in the directory you are sending from and place this line:

options(rpubs.upload.method = "internal")

in the .Rprofile or RProfile.site files.

你可以试试看。

dayushan commented 7 years ago

image image image

dayushan commented 7 years ago

好吧 问题已解决,感谢啦

coderLMN commented 7 years ago

刚看到,怎么解决的?

dayushan commented 7 years ago

不用RCurl包做CA验证,改用rvest包,但是不能爬取所有的链接,会提示Error in open.connection(x, "rb") : Timeout was reached 此外,在RCurl包里是getURL(url)后的结果可以用write()写成.html格式(见P24410.2处理文本数据上段内容),而rvest包我找不到有这个功能的函数。怎么办

coderLMN commented 7 years ago

rvest 包没用过,我感觉你前面的错误还是因为 .Rprofile 或 RProfile.site 的路径不对,要不就是 RProfile.site 里的 site 应该替换为网站的域名。

不管原因是什么,我建议你还是用 RCurl ,因为这个包在本书中会大量用到。

如果路径不清楚的话,根据 stackoverflow 上那个帖子里的评论,也可以直接在命令行执行

options(rpubs.upload.method = "internal")

这条语句。我试了一下,是可以直接执行的。

dayushan commented 7 years ago

(1)直接在RGUI中运行还是有问题 ,是win7系统的原因吗 image (2)用rvest包解析网页后得到list类型的数据,用write()函数写成.html格式文件会报错,请问老师怎么解决呢

dayushan commented 7 years ago

image 老师,用tm组件处理文本,加载tm包出错 怎么解决呀 也不知道slam包的开发者是谁?是否在github上有?除了tm包 有木有替代的包呀

coderLMN commented 7 years ago

包无法安装的问题似乎是国内镜像网站出问题了,我看到的错误信息是:

unable to access index for repository http://mirror.bjtu.edu.cn/cran/bin/macosx/mavericks/contrib/3.1

在 R Package Installer 界面里可以在左上角的下拉菜单里选 Other Repository ,右边会自动出来 http://R.research.att.com/ ,然后就可以安装了。

coderLMN commented 7 years ago

RCurl 的那个证书问题,可能是因为你的开发环境路径下没有那个证书文件。建议搜索一下这个文件,如果没有的话可以去 https://curl.haxx.se/ca/cacert.pem 下载 cacert.pem 。

对于这个问题,我还是建议你在 RCurl 包下尝试解决。可以参考书中 9.1.7 节(P201)的证书相关内容。

dayushan commented 7 years ago

image 老师,成功加载tm包,为什么hai 报错不存在prescindMeta()呢?怎么解决呢

coderLMN commented 7 years ago

在原书的勘误 http://www.r-datacollection.com/errata/errata.pdf 里说明了,prescindMeta() 和 sFilter() 函数都不适用于 v0.6 以上的 tm 包,可以换成 meta() 来处理:

The prescindMeta() function is defunct as of version 0.6 of the tm package. The meta data can now be gathered with the meta() function. meta_organisation <- meta(release_corpus, type = "local", tag = "organisation") meta_publication <- meta(release_corpus, type = "local", tag = "publication") meta_data <- data.frame( organisation = unlist(meta_organisation), publication = unlist(meta_publication) )

The sFilter() function is also defunct. You can filter the corpus using meta(). release_corpus <- release_corpus[ meta(release_corpus, tag = "organisation") == "Department for Business, Innovation & Skills" | meta(release_corpus, tag = "organisation") == "Department for Communities and Local Government" | meta(release_corpus, tag = "organisation") == "Department for Environment, Food & Rural Affairs" | meta(release_corpus, tag = "organisation") == "Foreign & Commonwealth Office" | meta(release_corpus, tag = "organisation") == "Ministry of Defence" | meta(release_corpus, tag = "organisation") == "Wales Office" ]

zchunc commented 6 years ago

证书报错的,我也遇到了,直接getURL("网址")不要CA证书的参数,也能正常运行。