P298半结构化文档解析信息

shangdawen commented 6 years ago

url<-"https://www.wcc.nrcs.usda.gov/ftpref/data/climate/table/temperature/history/california/"
###下载文件

doc <- htmlParse(url,encoding='UTF-8')
Warning message:
XML content does not seem to be XML: 'https://www.wcc.nrcs.usda.gov/ftpref/data/climate/table/temperature/history/california/' 
hrefs<-xpathSApply(rootNode,"//a/@href") 
names<-xpathSApply(rootNode,"//tbody/tr/td",xmlValue)     # /td:td层的内容xmlValue

这个到底怎么解呀？

shangdawen commented 6 years ago

url<-"https://www.wcc.nrcs.usda.gov/ftpref/data/climate/table/temperature/history/california/"

下载文件

tmp<-getURL(url) Error in function (type, msg, asError = TRUE) : error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version

shangdawen commented 6 years ago

###下载文件的替换方法
z<-c('19l03','19l05','19l06','19l07','19l08','19l13','19l17','19l19','19l24','19l38','19l39','19l40',
     '19l41','19l42','19l43','19l44','19l45','20h02','20h06','20h12','20h13','20k03','20k04','20k05',
     '20k13','20k25','20k27','20k30','20k31','20l02','20l06','20l10')
for(i in 1:length(z)){
dizhi<-paste('https://www.wcc.nrcs.usda.gov/ftpref/data/climate/table/temperature/history/california/',z[i],'s_tavg.txt',sep="")
x<-basename(dizhi)
download.file(dizhi,paste("C:\\Users\\kongwen\\Downloads\\Wiley-ADCR-master\\ch-13-parsing-tables\\Data",x,".txt"))
}

coderLMN commented 6 years ago

P298 给出的是用 ftp 协议下载文件，你的代码是自己写的吗？

url<-"https://www.wcc.nrcs.usda.gov/ftpref/data/climate/table/temperature/history/california/"
###下载文件

doc <- htmlParse(url,encoding='UTF-8')
Warning message:
XML content does not seem to be XML:

这个是按照 html 网页解析的方式。

而原书的代码是

ftp <- "ftp://ftp.wcc.nrcs.usda.gov/ftpref/data/climate/table/temperature/history/california/"
filelist <- getURL(ftp, dirlistonly = TRUE)

shangdawen commented 6 years ago

代码是自己写的，不过这个网页不是ftp格式了，是https格式

coderLMN commented 6 years ago

如果网页格式变成了 https，那么你的第二块代码应该是可以的：

url<-"https://www.wcc.nrcs.usda.gov/ftpref/data/climate/table/temperature/history/california/"
###下载文件
tmp<-getURL(url)

我这里是正常的，如果你那里有上面那个报错，可以试试在 getURL(url）后面加上参数 ,ssl.verifypeer = FALSE：

url<-"https://www.wcc.nrcs.usda.gov/ftpref/data/climate/table/temperature/history/california/"
###下载文件
tmp<-getURL(url,ssl.verifypeer = FALSE)

shangdawen commented 6 years ago

 url<-"https://www.wcc.nrcs.usda.gov/ftpref/data/climate/table/temperature/history/california/"
 getURL(url,ssl.verifypeer=FALSE)

Error in function (type, msg, asError = TRUE)  : 
  error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version

依旧报错

coderLMN commented 6 years ago

我用了 XML 和 RCurl 两个 library，不知道你用的是什么。如果实在不行，就用书里访问 https 网页的方法试试看。

shangdawen commented 6 years ago

楼主，R中用什么语句发现判断XPath是否存在呢？我用一个循环提取XPath中的数值，但是有一些网页的XPath并不存在，所以每次都报错，不能运行成功。我想找一个判断XPath是否存在的函数，不知道有没有？

xpathSApply(lianjie3,'/html/body/div[2]/div[2]/div[2]/div[1]/div[2]/div[2]/div/div[1]/div[1]/div/span[@class="tag tag1"]',xmlValue)

coderLMN commented 6 years ago

静态页面的结构如果存在一定规律的话， XPath 总是会有的，报错可能是因为你写的 XPath 不对。你可以一层一层地写，比如先把最外层的 '/html/body/div[2]' 写出来试试，看看取到的那个 div 是否是你要的，然后再加上下一层 div ，一步一步地缩小范围。

shangdawen commented 6 years ago

你好，我想爬网贷之家问题平台的数据，网址如下：https://shuju.wdzj.com/problem-1.html 我的代码如下：

library("XML", lib.loc="~/R/win-library/3.4")
library("stringr", lib.loc="~/R/win-library/3.4")
library("RCurl", lib.loc="~/R/win-library/3.4")

#设置报头
header<-c("User-Agent"="Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1.6) ",
          "Accept"="text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
          "Accept-Language"="en-us",
          "Connection"="keep-alive",
          "Accept-Charset"="GB2312,UTF-8;q=0.7,*;q=0.7")
wenti<-'https://shuju.wdzj.com/problem-1.html'
wenti_url<-getURL(wenti,.encoding = 'UTF-8')
wenti_parse<-htmlParse(wenti_url,encoding = 'UTF-8')
lianjie<-xpathSApply(wenti_parse,'//*[@id="sortTable"]//td[2]/div/a',xmlGetAttr,"href")#提取链接

但是只能提取到登录前的20个平台，怎么设置可以在R里面登录网贷之家，然后提取所有问题平台的链接呢？

coderLMN commented 6 years ago

需要登陆网站的话，可以用 selenium 来解决，具体做法可以参考书中9.1.9节，不过书中推荐的组件 Rwebdriver 并不好用，我推荐用 RSelenium。你还可以参考这个讨论：https://github.com/coderLMN/AutomatedDataCollectionWithR/issues/8#issuecomment-260245460

shangdawen commented 6 years ago

请问有哪些使用R做爬虫的书籍推荐吗？目前就买了《基于R语言的自动数据收集》这本书，但感觉还需要多学点，管道操作符和正则表达式都不会用，一些复杂的爬虫也不会编，想多看看书。

coderLMN commented 6 years ago

这本书已经比较全面了，ajax、xpath、正则表达式，还有就是对网页结构的理解，这些内容如果都掌握好了，复杂的爬虫也没问题。其他的书我不太了解，即使有，也不外乎是这些内容。对于学习技术来说，我觉得精读一本书比泛读很多本更有效。

coderLMN commented 6 years ago

RSelenium 文档里 findElement 的调用和你这里的不一样，例如：

webElem <- remDr$findElement(using = 'css', "input[name='q']")    

webElem <- remDr$findElement(using = 'css', "[name='q']")

webElem <- remDr$findElement('css', "[class = 'gsfi lst-d-f']")      # CSS 的方法 

webElem <- remDr$findElement('xpath', "//input[@id = 'lst-ib']")    # XPath 的方法

webElem$sendKeysToElement(list("R Cran", "\uE007"))

你给出的 '//[@id="logusername"]' 是个 XPath，但前面声明的是 CSS，而且参数值貌似也和文档不一致。所以我怀疑是因为你前面的 'css selector' 不对，导致 findElement 的结果为 null，你可以试试把 'css selector' 改成 'xpath' 。

另外我还注意到你两个 XPath 开始的反斜杠 // 不一样，莫非有一个是中文状态下输入的？这有可能也需要检查一下。

we0530 commented 6 years ago

url<-"https://www.wcc.nrcs.usda.gov/ftpref/data/climate/table/temperature/history/california/"
###下载文件
filelist<-getURL(url,ssl.verifypeer = FALSE)
if(!file.exists("Data")) dir.create("Data")

# get list of files from ftp
filelist <- getURL(url, dirlistonly = TRUE )
filelist <- unlist(str_split(filelist,"\r\n"))
filelist <- filelist[!filelist==""]
filelist

谁能帮忙找一个可以替代上面的网址的网址，也是有关气象数据的，可以下载txt文件，我下周要交这一章的作业，还要做PPT，求大神帮忙。请求支援，急用！谢谢

coderLMN commented 6 years ago

我这里 https://www.wcc.nrcs.usda.gov/ftpref/data/climate/table/temperature/history/california/ 这个网址是可以访问的，你再试试看。

we0530 commented 6 years ago

url<-"https://www.wcc.nrcs.usda.gov/ftpref/data/climate/table/temperature/history/california/"
filelist<-getURL(url,dirlistonly=TRUE)

Error in function (type, msg, asError = TRUE)  : 
  error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version

@@@又出现了新的错误，要升级TLSv1,但是不会，还要麻烦您解答一下，谢谢了

coderLMN commented 6 years ago

这个 url 不是 ftp 资源，而是服从 https 协议，所以 dirlistonly = TRUE 选项不适用，你可以参考 P.110 的选项清单，利用 http(s) 的方式去读取网页内容，而不能采用原书中 ftp 下载的方式。

we0530 commented 6 years ago

所有的都试了一遍还是错 error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version

不知道到底哪错了，好难受

coderLMN commented 6 years ago

我这里的环境没有这个问题，要不你不要用 RCurl，换 httr 组件包试试。

we0530 commented 6 years ago

非常感谢楼主的耐心解答，我的问题已经解决了，更改了一下，分享给大家

options(download.file.method="libcurl")#保证可以连接到服务器
if(!file.exists("Data")) dir.create("Data")#创建文件夹
setwd("D:/Documents/Data")#设置下载路径
z<-c('19l03','19l05','19l06','19l07','19l08','19l13','19l17','19l19','19l24','19l38','19l39','19l40',
     '19l41','19l42','19l43','19l44','19l45','20h02','20h06','20h12','20h13','20k03','20k04','20k05',
     '20k13','20k25','20k27','20k30','20k31','20l02','20l06','20l10')
for(i in 1:length(z)){
  dizhi<-paste('https://www.wcc.nrcs.usda.gov/ftpref/data/climate/table/temperature/history/california/',z[i],'s_tavg.txt',sep="")
  x<-basename(dizhi)
  download.file(dizhi,paste(x,".txt"))
}#利用循环下载数据，下载数据s_tmax.txt  s_tmin.txt

length(list.files("Data"))#查看文件夹数量
list.files("Data")[1:3]

setwd("D:/Documents")#更改路径
filelist <- unlist(str_split(list.files("Data"),"\r\n"))#识别文件名，拆分文本
filelist <- filelist[!filelist==""]
filelist

filesavg <- str_detect(filelist,"tavg")#查询并保留文件名里含有tavg的文件
filesavg<-filelist[filesavg]
filesavg[1:3]#列出前三项

we0530 commented 6 years ago

p302运行的结果不对，不知道什么原因

temperatures <- str_extract(txtparts, "day.*") #利用day及其后面所有的内容
tempData<- data.frame(avgtemp=NA, day=NA, month=NA, year=NA, id="", name="")
tf    <- tempfile()#先用tempfile()函数把气温表写到一个临时文件里
writeLines(temperatures, tf)
temptable<-read.fwf(tf,widths=c(3,7,rep(6,11)),stringsAsFactors=FALSE)
temptable[c(1:5,32:38),1:10]

运行结果是不对的
V1      V2     V3     V4     V5     V6     V7     V8     V9    V10
1  day     oct    nov    dec    jan    feb    mar    apr    may    jun
2  day     oct    nov    dec    jan    feb    mar    apr    may    jun
3  day     oct    nov    dec    jan    feb    mar    apr    may    jun
4  day     oct    nov    dec    jan    feb    mar    apr    may    jun
5  day     oct    nov    dec    jan    feb    mar    apr    may    jun
32 day     oct    nov    dec    jan    feb    mar    apr    may    jun
33 day     oct    nov    dec    jan    feb    mar    apr    may    jun
34 day     oct    nov    dec    jan    feb    mar    apr    may    jun
35 day     oct    nov    dec    jan    feb    mar    apr    may    jun
36 day     oct    nov    dec    jan    feb    mar    apr    may    jun
37 day     oct    nov    dec    jan    feb    mar    apr    may    jun
38 day     oct    nov    dec    jan    feb    mar    apr    may    jun

coderLMN commented 6 years ago

貌似数据格式有变化，你可以参考一下原书在 github 上的代码：https://github.com/crubba/Wiley-ADCR/blob/master/ch-13-parsing-tables/ch-13-parsing-tables.r ，里面的正则表达式都和书里的不一样了。

we0530 commented 6 years ago

start <- proc.time()
temperatures <- str_extract(txtparts, "day.*") 
tempData <- data.frame(avgtemp=NA, day=NA, month=NA, year=NA, id="", name="")
day      <- rep(1:31, 12)
month    <- rep( c(10:12,1:9), each=31 ) 

if(F==T){
for(i in seq_along(txtparts)){
    tf <- tempfile()
    writeLines(temperatures[i], tf)
    temptable <- read.fwf(tf, width=c(3,7,6,6,6,6,6,6,6,6,6,6,6), stringsAsFactors=F)
    temptable <- temptable[3:33, -1]
    temptable <- suppressWarnings(as.numeric(unlist(temptable)))
    temptable <- data.frame( avgtemp=temptable, day=day,      month=month, 
                             year=year[i],      name=name[i], id=id[i]     )
    tempData <- rbind(tempData, temptable)
}
                                                proc.time() - start
}

老师您好，代码如上，看不太懂，if(F==T)是什么呀？还有为什么要用 proc.time() - start和 proc.time() ？感觉这一段代码不连贯，放在一起运行不出来书中的结果？还有13.3中气象台的数据CA_sites.dat没有找到，请问您有链接吗？谢谢

coderLMN commented 6 years ago

if(F==T) 其实就是把这段代码给注释掉了，proc.time() 是用来计算处理时间的，这两个都没有用。主要是参考里面对于文本数据的处理，比如 temperatures <- str_extract(txtparts, "day[\\s\\S]*") 这一句和原书里的正则表达式就不一样，还有像 temptable <- read.fwf(tf, width=c(3,7,6,6,6,6,6,6,6,6,6,6,6), stringsAsFactors=F) 这些，你可以处理完一句就检查一下结果是否正确，这样有利于定位出错的位置，分析出错原因。 CA_sites.dat 我也没有找到，可能要花点时间多搜索一下。

we0530 commented 6 years ago

CA_sites.dat这个数据找了好久还是没有找到，老师您能不能帮忙找一下，这一章现在就差这个数据了，谢谢

coderLMN commented 6 years ago

https://wrcc.dri.edu/Monitoring/Stations/station_inventory_show.php?snet=snotel&sstate=CA 我看了一下应该是对的。

内容我在这里也贴一遍，以防这个页面将来也找不到了：

Station Data Inventory Listings
Snotel Network: California

 WRCC Snotel Inventory.  Last updated 970307.  Kelly Redmond.

 Hbk5 NRCSID STNUM Sitename             Lat. Long. Elev. SDPXNV Start  End   
----- ------ ----- -------------------- ---- ----- ----- ------ ------ ------
ADMC1 20H13S 04001 ADIN MTN             4115 12046  6200 101000 841001 890930
ADMC1 20H13S 04001 ADIN MTN             4115 12046  6200 101111 891001 999999
BLAC1 19L05S 04002 BLUE LAKES           3836 11955  8000 101000 801001 830930
BLAC1 19L05S 04002 BLUE LAKES           3836 11955  8000 101111 831001 999999
CDRC1 20H06S 04003 CEDAR PASS           4135 12018  7100 101000 781001 900930
CDRC1 20H06S 04003 CEDAR PASS           4135 12018  7100 101111 901001 999999
CSSC1 20K31S 04004 CSS LAB              3920 12022  6900 100000 811001 830930
CSSC1 20K31S 04004 CSS LAB              3920 12022  6900 101000 831001 860930
CSSC1 20K31S 04004 CSS LAB              3920 12022  6900 101110 861001 870714
CSSC1 20K31S 04004 CSS LAB              3920 12022  6900 101111 870715 999999
DMLC1 20H12S 04005 DISMAL SWAMP         4158 12010  7000 101000 801001 890930
DMLC1 20H12S 04005 DISMAL SWAMP         4158 12010  7000 101111 891001 999999
EFTC1 19L19S 04006 EBBETTS PASS         3833 11948  8700 101000 781001 870930
EFTC1 19L19S 04006 EBBETTS PASS         3833 11948  8700 101111 871001 999999
ECOC1 20L06S 04007 ECHO PEAK            3851 12004  7800 101000 801001 840930
ECOC1 20L06S 04007 ECHO PEAK            3851 12004  7800 101111 840930 999999
FLFC1 20L10S 04008 FALLEN LEAF          3856 12003  6300 101000 791001 900930
FLFC1 20L10S 04008 FALLEN LEAF          3856 12003  6300 101111 901001 999999
HGNC1 19L03S 04009 HAGAN'S MEADOW       3851 11956  8000 101000 781001 861001
HGNC1 19L03S 04009 HAGAN'S MEADOW       3851 11956  8000 101110 861002 870614
HGNC1 19L03S 04009 HAGAN'S MEADOW       3851 11956  8000 101111 870615 999999
HVNC1 19L24S 04010 HEAVENLY VALLEY      3856 11954  8850 101000 781001 900930
HVNC1 19L24S 04010 HEAVENLY VALLEY      3856 11954  8850 101111 901001 999999
ICPC1 20K04S 04011 INDEPENDENCE CAMP    3927 12017  7000 101000 781001 830930
ICPC1 20K04S 04011 INDEPENDENCE CAMP    3927 12017  7000 101111 831001 999999
ICKC1 20K03S 04012 INDEPENDENCE CREEK   3929 12017  6500 101000 801001 900930
ICKC1 20K03S 04012 INDEPENDENCE CREEK   3929 12017  6500 101111 901001 999999
ILKC1 20K05S 04013 INDEPENDENCE LAKE    3925 12019  8450 101000 781001 940930
ILKC1 20K05S 04013 INDEPENDENCE LAKE    3925 12019  8450 101111 941001 999999
LELC1 19L38S 04014 LEAVITT LAKE         3816 11937  9400 101111 891001 999999
LVTC1 19L08S 04015 LEAVITT MEADOWS      3820 11933  7200 101000 801001 890930
LVTC1 19L08S 04015 LEAVITT MEADOWS      3820 11933  7200 101111 891001 999999
LOBC1 19L17S 04016 LOBDELL LAKE         3826 11922  9200 101000 781001 890930
LOBC1 19L17S 04016 LOBDELL LAKE         3826 11922  9200 101111 891001 999999
MNPC1 19L40S 04017 MONITOR PASS         3835 11936  8350 101111 901001 999999
XXXC1 19L06S 04018 POISON FLAT          3830 11938  7900 101000 801001 870930 ?
XXXC1 19L06S 04018 POISON FLAT          3830 11938  7900 101111 881001 999999
RUBC1 20L02S 04019 RUBICON #2           3900 12008  7500 101000 801001 900930
RUBC1 20L02S 04019 RUBICON #2           3900 12008  7500 101111 901001 999999
SRAC1 19L07S 04020 SONORA PASS          3819 11936  8800 101000 781001 820930
SRAC1 19L07S 04020 SONORA PASS          3819 11936  8800 101111 821001 999999
SPCC1 19L39S 04021 SPRATT CREEK         3840 11949  6200 101000 801001 880930
SPCC1 19L39S 04021 SPRATT CREEK         3840 11949  6200 101110 881001 890418
SPCC1 19L39S 04021 SPRATT CREEK         3840 11949  6200 101111 890419 999999
SQWC1 20K30S 04022 SQUAW VALLEY G.C.    3911 12015  8200 101000 801001 900930
SQWC1 20K30S 04022 SQUAW VALLEY G.C.    3911 12015  8200 101111 901001 999999
THOC1 20K27S 04023 TAHOE CITY CROSS     3910 12009  6750 001000 801001 810930
THOC1 20K27S 04023 TAHOE CITY CROSS     3910 12009  6750 101000 811001 880930
THOC1 20K27S 04023 TAHOE CITY CROSS     3910 12009  6750 101110 881001 890417
THOC1 20K27S 04023 TAHOE CITY CROSS     3910 12009  6750 101111 890418 999999
TRUC1 20K13S 04024 TRUCKEE #2           3918 12012  6400 101000 801001 880930
TRUC1 20K13S 04024 TRUCKEE #2           3918 12012  6400 101110 881001 890417
TRUC1 20K13S 04024 TRUCKEE #2           3918 12012  6400 101111 890418 999999
VGAC1 19L13S 04025 VIRGINIA LAKES RIDGE 3805 11915  9200 101000 781001 820930
VGAC1 19L13S 04025 VIRGINIA LAKES RIDGE 3805 11915  9200 101111 821001 999999
WRDC1 20K25S 04026 WARD CREEK #3        3908 12014  6750 101000 781001 900930
WRDC1 20K25S 04026 WARD CREEK #3        3908 12014  6750 101111 901001 999999
????? 19L18S 04027 WET MEADOWS          3837 11952  8050 101000 801001 900731

  Hbk5 is National Weather Service Handbook 5 ID.  Question mark (?) follows 
   if guessed or inferred from circumstantial evidence
  ????? means NRCS ID exists, but Handbook 5 ID not found
  NRCSID is NRCS ID
  NWS Handbook 5 IDs may not have been assigned for now-deactivated stations;
   NWS Handbook 5 IDs were taken from NWS Location Identifier software
  Sitename is NRCS name; none appear to have changed during the lifetime 
   of the network
  Lat is Deg Min N
  Lon is Deg Min W
  Elevation is in feet, from NRCS files
  NWS Handbook 5 position/elevations often differ from NRCS values, NRCS used
  SDPXNV is indicator of elements reported. 1-present, 0-absent
  S-Snow Water Equivalent, D-Depth of snow, P-Precipitation
  X-Maximum Temp, N-Minimum Temp, V-Average Daily Temp
  Start is start date for this entry in format yymmdd
  End is end date for this entry in format yymmdd
  New entry for every change in IDs, names, positions, elements reported
  ? at very end indicates major uncertainty about station name or ID

we0530 commented 6 years ago

@coderLMN stationData <- read.csv("Data_CA/CA_sites.dat", header=F, sep="|")[,-c(1:3,8:10)] 老师，下载的.dat文件不能直接读取，读取这里还要修改，这个.dat文件这部分有关气象站的数据，是不是还要像解析半文档一样，解析提取出 Hbk5 NRCSID STNUM Sitename Lat. Long. Elev. SDPXNV Start End 这些数据？试了试没有解析出来

coderLMN commented 6 years ago

这个文件的格式和原文件有差异，比如 sep="|" 这个参数就不对，因为每个数据项的分隔符不是 | 而是制表符 /t，其他内容你也需要自己解析一下看看是否正确。

coderLMN commented 6 years ago

你需要的数据项应该就是这几项吧：

NRCSID Sitename             Lat. Long. Elev.

其他的可以略去，如果你要用 header=F 这个参数，那么就只要那一行横杠底下的数据就可以了，别的内容不要存到文件里，这样解析比较方便，还可以先把它存为 .csv 文件，用 Excel 打开，并删掉不需要的几列数据，然后再从 R 里读取需要的几项数据。

we0530 commented 6 years ago

是的老师，您说的这种方法怎么把它存在csv中，具体怎么操作？谢谢老师！

coderLMN commented 6 years ago

比如你可以打开一个文字编辑器，把这些内容粘贴进去：

ADMC1 20H13S 04001 ADIN MTN             4115 12046  6200 101000 841001 890930
ADMC1 20H13S 04001 ADIN MTN             4115 12046  6200 101111 891001 999999
BLAC1 19L05S 04002 BLUE LAKES           3836 11955  8000 101000 801001 830930
BLAC1 19L05S 04002 BLUE LAKES           3836 11955  8000 101111 831001 999999
CDRC1 20H06S 04003 CEDAR PASS           4135 12018  7100 101000 781001 900930
CDRC1 20H06S 04003 CEDAR PASS           4135 12018  7100 101111 901001 999999
CSSC1 20K31S 04004 CSS LAB              3920 12022  6900 100000 811001 830930
CSSC1 20K31S 04004 CSS LAB              3920 12022  6900 101000 831001 860930
CSSC1 20K31S 04004 CSS LAB              3920 12022  6900 101110 861001 870714
CSSC1 20K31S 04004 CSS LAB              3920 12022  6900 101111 870715 999999
DMLC1 20H12S 04005 DISMAL SWAMP         4158 12010  7000 101000 801001 890930
DMLC1 20H12S 04005 DISMAL SWAMP         4158 12010  7000 101111 891001 999999
EFTC1 19L19S 04006 EBBETTS PASS         3833 11948  8700 101000 781001 870930
EFTC1 19L19S 04006 EBBETTS PASS         3833 11948  8700 101111 871001 999999
ECOC1 20L06S 04007 ECHO PEAK            3851 12004  7800 101000 801001 840930
ECOC1 20L06S 04007 ECHO PEAK            3851 12004  7800 101111 840930 999999
FLFC1 20L10S 04008 FALLEN LEAF          3856 12003  6300 101000 791001 900930
FLFC1 20L10S 04008 FALLEN LEAF          3856 12003  6300 101111 901001 999999
HGNC1 19L03S 04009 HAGAN'S MEADOW       3851 11956  8000 101000 781001 861001
HGNC1 19L03S 04009 HAGAN'S MEADOW       3851 11956  8000 101110 861002 870614
HGNC1 19L03S 04009 HAGAN'S MEADOW       3851 11956  8000 101111 870615 999999
HVNC1 19L24S 04010 HEAVENLY VALLEY      3856 11954  8850 101000 781001 900930
HVNC1 19L24S 04010 HEAVENLY VALLEY      3856 11954  8850 101111 901001 999999
ICPC1 20K04S 04011 INDEPENDENCE CAMP    3927 12017  7000 101000 781001 830930
ICPC1 20K04S 04011 INDEPENDENCE CAMP    3927 12017  7000 101111 831001 999999
ICKC1 20K03S 04012 INDEPENDENCE CREEK   3929 12017  6500 101000 801001 900930
ICKC1 20K03S 04012 INDEPENDENCE CREEK   3929 12017  6500 101111 901001 999999
ILKC1 20K05S 04013 INDEPENDENCE LAKE    3925 12019  8450 101000 781001 940930
ILKC1 20K05S 04013 INDEPENDENCE LAKE    3925 12019  8450 101111 941001 999999
LELC1 19L38S 04014 LEAVITT LAKE         3816 11937  9400 101111 891001 999999
LVTC1 19L08S 04015 LEAVITT MEADOWS      3820 11933  7200 101000 801001 890930
LVTC1 19L08S 04015 LEAVITT MEADOWS      3820 11933  7200 101111 891001 999999
LOBC1 19L17S 04016 LOBDELL LAKE         3826 11922  9200 101000 781001 890930
LOBC1 19L17S 04016 LOBDELL LAKE         3826 11922  9200 101111 891001 999999
MNPC1 19L40S 04017 MONITOR PASS         3835 11936  8350 101111 901001 999999
XXXC1 19L06S 04018 POISON FLAT          3830 11938  7900 101000 801001 870930 ?
XXXC1 19L06S 04018 POISON FLAT          3830 11938  7900 101111 881001 999999
RUBC1 20L02S 04019 RUBICON #2           3900 12008  7500 101000 801001 900930
RUBC1 20L02S 04019 RUBICON #2           3900 12008  7500 101111 901001 999999
SRAC1 19L07S 04020 SONORA PASS          3819 11936  8800 101000 781001 820930
SRAC1 19L07S 04020 SONORA PASS          3819 11936  8800 101111 821001 999999
SPCC1 19L39S 04021 SPRATT CREEK         3840 11949  6200 101000 801001 880930
SPCC1 19L39S 04021 SPRATT CREEK         3840 11949  6200 101110 881001 890418
SPCC1 19L39S 04021 SPRATT CREEK         3840 11949  6200 101111 890419 999999
SQWC1 20K30S 04022 SQUAW VALLEY G.C.    3911 12015  8200 101000 801001 900930
SQWC1 20K30S 04022 SQUAW VALLEY G.C.    3911 12015  8200 101111 901001 999999
THOC1 20K27S 04023 TAHOE CITY CROSS     3910 12009  6750 001000 801001 810930
THOC1 20K27S 04023 TAHOE CITY CROSS     3910 12009  6750 101000 811001 880930
THOC1 20K27S 04023 TAHOE CITY CROSS     3910 12009  6750 101110 881001 890417
THOC1 20K27S 04023 TAHOE CITY CROSS     3910 12009  6750 101111 890418 999999
TRUC1 20K13S 04024 TRUCKEE #2           3918 12012  6400 101000 801001 880930
TRUC1 20K13S 04024 TRUCKEE #2           3918 12012  6400 101110 881001 890417
TRUC1 20K13S 04024 TRUCKEE #2           3918 12012  6400 101111 890418 999999
VGAC1 19L13S 04025 VIRGINIA LAKES RIDGE 3805 11915  9200 101000 781001 820930
VGAC1 19L13S 04025 VIRGINIA LAKES RIDGE 3805 11915  9200 101111 821001 999999
WRDC1 20K25S 04026 WARD CREEK #3        3908 12014  6750 101000 781001 900930
WRDC1 20K25S 04026 WARD CREEK #3        3908 12014  6750 101111 901001 999999
????? 19L18S 04027 WET MEADOWS          3837 11952  8050 101000 801001 900731

然后保存为 CA_sites.csv，然后就可以用 excel 打开它了。

另外，第二列的数据都需要在前面加上 'CA'，比如第一项 20H13S 应该改为 CA20H13S，这样才能和书中代码里的 id 项格式相符，其他数据项我没时间仔细看，你可以把这个文件的内容和书中列出的数据项比较一下，看看是否还有其他需要修正的地方。

we0530 commented 6 years ago

@coderLMN 我之前复制粘贴到txt里，竟然不行，复制数据，粘贴在excel中，分一下列，然后读取就可以了，谢谢老师！

we0530 commented 6 years ago

@coderLMN 老师，RgoogleMaps谷歌地图我这边的网进不去，而百度地图这个包RbaiduMaps现在也没有了，请问您能不能下载下来程序中的那两个图map1.png和map2.png？

coderLMN commented 6 years ago

we0530 commented 6 years ago

@coderLMN 老师，现在谷歌地图这个包RgooleMaps不能用，RbaiduMaps现在已失效，请问老师现在还有类似的包并在中国能用的吗？

coderLMN commented 6 years ago

你可以搜索一下 R 地图包，比如我搜到了一个：http://blog.163.com/digoal@126/blog/static/16387704020153154589234/

weiwudi commented 6 years ago

u <- "http://www.elections.state.md.us/elections/2012/election_data/index.html" a<-getURL(u) page_parse <- htmlParse(u, encoding = "utf-8") Error: failed to load external entity "http://www.elections.state.md.us/elections/2012/election_data/index.html" 为什么用htmlParse不能解析

coderLMN commented 6 years ago

这个网页已经改为 https 协议，并自动跳转到了 https://elections.maryland.gov/elections/2012/election_data/index.html

所以用 RCurl 去访问的时候，无法匹配到 "http://www.elections.state.md.us 这个域名的 SSL 证书，所以会报错。

我用下面的代码访问成功了：

u <- "https://elections.maryland.gov/elections/2012/election_data/index.html"
a <- getURL(u, ssl.verifypeer = FALSE, encoding = 'UTF-8')

Sarahbiu commented 5 years ago

谷歌可以用，但一直下载不了地图数据

library(RgoogleMaps) library(png) map <- GetOsmMap(latR = c(37.5,42),lonR = c(-125,-115), scale = 5000000, destfile = "map.png", GRAYSCALE = TRUE, NEWMAP = TRUE) [1] "http://tile.openstreetmap.org/cgi-bin/export?bbox=-125,37.5,-115,42&scale=5000000&format=png" trying URL 'http://tile.openstreetmap.org/cgi-bin/export?bbox=-125,37.5,-115,42&scale=5000000&format=png' Error in download.file(url, destfile, mode = "wb", quiet = FALSE) : cannot open URL 'http://tile.openstreetmap.org/cgi-bin/export?bbox=-125,37.5,-115,42&scale=5000000&format=png' In addition: Warning message: In download.file(url, destfile, mode = "wb", quiet = FALSE) : cannot open URL 'http://tile.openstreetmap.org/cgi-bin/export?bbox=-125,37.5,-115,42&scale=5000000&format=png': HTTP status was '400 Bad Request'

coderLMN commented 5 years ago

这个组件的接口修改了，调用方式改成了：

map <- GetMap(center = c(37.5,42), zoom = 5, destfile = "map.png", GRAYSCALE = TRUE, NEWMAP = TRUE)

而且goole map需要先注册一个app key才可以用（Google返回错误信息：The Google Maps Platform server rejected your request. You must use an API key to authenticate each request to Google Maps Platform APIs. For additional information, please refer to http://g.co/dev/maps-no-account）

详细文档可以参考：http://rgooglemaps.r-forge.r-project.org/

Sarahbiu commented 5 years ago

老师您好，对于在R里用谷歌地图我还是很困惑。我在谷歌云平台申请了那个API密钥，然后它要怎么连接到R里吗？那个API密钥需要国际信用卡验证身份吗？您能说具体一点吗，谢谢。

coderLMN commented 5 years ago

新版本的接口好像又有变化，我这两天比较忙，周末争取能给你调好一些代码。

Sarahbiu commented 5 years ago

好的谢谢老师

coderLMN commented 5 years ago

申请 API 密钥需要先确定支付方式，因为 Google 地图 API 是收费的，少量调用是 1000 次收费 2 美元。参见 Google 文档：https://developers.google.com/maps/documentation/maps-static/get-api-key （API 密钥申请）和 https://developers.google.com/maps/documentation/maps-static/usage-and-billing （用量及费用）。

在 R 里的使用方法是在绘图代码里登记 API 密钥：

map <- GetMap(center = c(37.5,42), zoom = 5, destfile = "map.png", GRAYSCALE = TRUE, NEWMAP = TRUE,             # 原先就只有这些参数，现在需要填写下面这个
API_console_key = "你申请的 API 密钥"    # 用这个参数输入 API 密钥

在 RgoogleMaps 文档 https://www.rdocumentation.org/packages/RgoogleMaps/versions/1.4.3/topics/GetMap 里能看到相关的信息。

Sarahbiu commented 5 years ago

非常感谢老师的热心帮助，谢谢！

coderLMN / AutomatedDataCollectionWithR

P298半结构化文档解析信息 #19

下载文件