Page244证书验证报错

dayushan commented 7 years ago

.libPaths("D:/R/library")
library(RCurl)
library(XML)
library(stringr)
library(plyr)
####因为内容是保存在一个HTTPS服务器上的，需要指定CA签名的位置
all_links<-character()
new_results<-'government/announcements?keywords=&announcement_filter_option=all&topics%5B%5D=all&departments%5B%5D=all&world_locations%5B%5D=all&from_date=&to_date=1%2F7%2F2010'
signatures=system.file("CurlSSL",cainfo='cacert.pem',package='RCurl')
while(length(new_results>0)){
    new_results<-str_c("https://www.gov.uk/",new_results)
    results<-getURL(new_results,cainfo=signatures)#ssl.verifypeer=FALSE,不验证时并不能获取公告的链接
    ##验证的时候报错 SSL certificate problem: unable to get local issuer certificate
    results_tree<-htmlParse(results)
    ##获得公告的链接   
    all_links<-c(all_links,xpathSApply(results_tree,'//li[@id]//a',xmlGetAttr,'href'))
    ##获得新闻页面的链接
    new_results<-xpathSApply(results_tree,'//nav[@id="show-more-documents"]//li[@class="next"]//a',xmlGetAttr,"href")

}
all_links[1]
length(all_links)

进行证书验证时报错：

Error in function (type, msg, asError = TRUE)  : 
SSL certificate problem: unable to get local issuer certificate

不进行证书验证又態下载公告的链接，即all_links为list() 请教下老师，RCurl包在进行证书验证报错应该怎么解决呢

coderLMN commented 7 years ago

我运行这些代码成功了，除了第一行 .libPaths("D:/R/library") 没有用到。得到的结果如下：

all_links[1] [1] "/government/speeches/secretary-of-state-for-culture-media-and-sport-written-statement-on-exercise-of-functions-under-the-public-libraries-and-museums-act-1964" length(all_links) [1] 2097

dayushan commented 7 years ago

.libPaths("D:/R/library") 这个语句是指定库，用了没问题，但跑完代码报错 RGUI 3.2.5版本。请问下老师这是什么原因？怎么解决呢

coderLMN commented 7 years ago

我没有遇到这个问题，所以没办法 debug。但是查了一下错误信息，找到了这个 stackoverflow 的帖：http://stackoverflow.com/questions/22537180/error-while-publishing-in-r-pubs ，里面说：

Add an .Rprofile file in the directory you are sending from and place this line:
options(rpubs.upload.method = "internal")
in the .Rprofile or RProfile.site files.

你可以试试看。

dayushan commented 7 years ago

好吧问题已解决，感谢啦

coderLMN commented 7 years ago

刚看到，怎么解决的？

dayushan commented 7 years ago

不用RCurl包做CA验证，改用rvest包，但是不能爬取所有的链接，会提示Error in open.connection(x, "rb") : Timeout was reached 此外，在RCurl包里是getURL(url)后的结果可以用write()写成.html格式（见P24410.2处理文本数据上段内容），而rvest包我找不到有这个功能的函数。怎么办

coderLMN commented 7 years ago

rvest 包没用过，我感觉你前面的错误还是因为 .Rprofile 或 RProfile.site 的路径不对，要不就是 RProfile.site 里的 site 应该替换为网站的域名。

不管原因是什么，我建议你还是用 RCurl ，因为这个包在本书中会大量用到。

如果路径不清楚的话，根据 stackoverflow 上那个帖子里的评论，也可以直接在命令行执行

options(rpubs.upload.method = "internal")

这条语句。我试了一下，是可以直接执行的。

dayushan commented 7 years ago

（1）直接在RGUI中运行还是有问题，是win7系统的原因吗（2）用rvest包解析网页后得到list类型的数据，用write（）函数写成.html格式文件会报错，请问老师怎么解决呢

dayushan commented 7 years ago

老师，用tm组件处理文本，加载tm包出错怎么解决呀也不知道slam包的开发者是谁？是否在github上有？除了tm包有木有替代的包呀

coderLMN commented 7 years ago

包无法安装的问题似乎是国内镜像网站出问题了，我看到的错误信息是：

unable to access index for repository http://mirror.bjtu.edu.cn/cran/bin/macosx/mavericks/contrib/3.1

在 R Package Installer 界面里可以在左上角的下拉菜单里选 Other Repository ，右边会自动出来 http://R.research.att.com/ ，然后就可以安装了。

coderLMN commented 7 years ago

RCurl 的那个证书问题，可能是因为你的开发环境路径下没有那个证书文件。建议搜索一下这个文件，如果没有的话可以去 https://curl.haxx.se/ca/cacert.pem 下载 cacert.pem 。

对于这个问题，我还是建议你在 RCurl 包下尝试解决。可以参考书中 9.1.7 节（P201）的证书相关内容。

dayushan commented 7 years ago

老师，成功加载tm包，为什么hai 报错不存在prescindMeta()呢？怎么解决呢

coderLMN commented 7 years ago

在原书的勘误 http://www.r-datacollection.com/errata/errata.pdf 里说明了，prescindMeta() 和 sFilter() 函数都不适用于 v0.6 以上的 tm 包，可以换成 meta() 来处理：

The prescindMeta() function is defunct as of version 0.6 of the tm package. The meta data can now be gathered with the meta() function. meta_organisation <- meta(release_corpus, type = "local", tag = "organisation") meta_publication <- meta(release_corpus, type = "local", tag = "publication") meta_data <- data.frame( organisation = unlist(meta_organisation), publication = unlist(meta_publication) )

The sFilter() function is also defunct. You can filter the corpus using meta(). release_corpus <- release_corpus[ meta(release_corpus, tag = "organisation") == "Department for Business, Innovation & Skills" | meta(release_corpus, tag = "organisation") == "Department for Communities and Local Government" | meta(release_corpus, tag = "organisation") == "Department for Environment, Food & Rural Affairs" | meta(release_corpus, tag = "organisation") == "Foreign & Commonwealth Office" | meta(release_corpus, tag = "organisation") == "Ministry of Defence" | meta(release_corpus, tag = "organisation") == "Wales Office" ]

zchunc commented 6 years ago

证书报错的，我也遇到了，直接getURL("网址")不要CA证书的参数，也能正常运行。

coderLMN / AutomatedDataCollectionWithR

Page244证书验证报错 #9