coderLMN / AutomatedDataCollectionWithR

《基于 R 语言的自动化数据采集技术》读者讨论区
28 stars 10 forks source link

第16章 采集关于收集的数据 #20

Open we0530 opened 6 years ago

we0530 commented 6 years ago

@coderLMN 老师您好!

url <- str_c(baseURL,keyword)
firstSearchPage <- getURL(url, encoding = "UTF-8")
parsedFirstSearchPage <- htmlParse(firstSearchPage, encoding = "UTF-8")

运行出来HTML的结果与网页中的不匹配,有缺失和乱码出现,请问这是什么原因造成的?

下面是运行出来的结果:

   <div class="a-divider a-divider-section"><div class="a-divider-inner"></div></div>

        <div class="a-text-center a-spacing-small a-size-mini">
            <a href="https://www.amazon.com/gp/help/customer/display.html/ref=footer_cou?ie=UTF8&amp;nodeId=508088">Conditions of Use</a>
            <span class="a-letter-space"></span>
            <span class="a-letter-space"></span>
            <span class="a-letter-space"></span>
            <span class="a-letter-space"></span>
            <a href="https://www.amazon.com/gp/help/customer/display.html/ref=footer_privacy?ie=UTF8&amp;nodeId=468496">Privacy Policy</a>
        </div>
        <div class="a-text-center a-size-mini a-color-secondary">
          © 1996-2014, Amazon.com, Inc. or its affiliates
          <script>
           if (true === true) {
             document.write('<img src="https://fls-na.amaz'+'on.com/'+'1/oc-csi/1/OP/requestId=WHMJ70XYD6ZR2TSDJGT9&js=1" />');
           };
          </script><noscript>
            <img src="https://fls-na.amazon.com/1/oc-csi/1/OP/requestId=WHMJ70XYD6ZR2TSDJGT9&amp;js=0">
</noscript>
        </div>
    </div>
    <script>
    if (true === true) {
        var elem = document.createElement("script");
        elem.src = "https://images-na.ssl-images-amazon.com/images/G/01/csminstrumentation/csm-captcha-instrumentation.min._V" + (+ new Date()) + "_.js";
        document.getElementsByTagName('head')[0].appendChild(elem);
    }
    </script>
</body>
</html>
coderLMN commented 6 years ago

我在 P.331 下面的译者注里已经提到过这个问题,随着业务和技术的变化,网页经常会改版,结构也会发生变化,因此不能照搬书中的代码,要自己先对网页进行分析,再编写 xpath 来提取数据。

nuomijii commented 6 years ago

@coderLMN 老师,我已经修改了网页变化的部分,但是运行下来第一个restrictedSearchPageLink是NULL。

library(stringr)
 library(XML)
 library(RCurl)
 baseURL <- "https://www.amazon.com/s/rh=n%3A2407749011%2Ck%3A&keywords="
 keyword <- "Apple"
 url <- str_c(baseURL,keyword)
 firstSearchPage <- getURL(url)
 parsedFirstSearchPage <- htmlParse(firstSearchPage)
 xpath <- str_c('//span[@Class="a-link-normal a-text-normal" and text()="',keyword,'"]/../@href')
 xpath
 [1] "//a[@Class="a-link-normal a-text-normal" and text()="Apple"]/../@href"
 restrictedSearchPageLink <- xpathApply(parsedFirstSearchPage, xpath)
 restrictedSearchPageLink
 NULL
coderLMN commented 6 years ago

你看一下 parsedFirstSearchPage ,在 <body> 标签下面有注释:

<!--
To discuss automated access to Amazon data please contact api-services-support@amazon.com.
For information about migrating to our APIs refer to our Marketplace APIs at 
https://developer.amazonservices.com/ref=rm_c_sv, or our Product Advertising API at 
https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_c_ac 
for advertising use cases.
-->

就是说 Amazon 现在不允许自动爬取它的网页了,如果要通过接口读取它的数据,需要申请。往下看,这个网页的内容很少,就是一个对付机器人的验证码。

对这种情况,你可以尝试用 Selenium 来抓取这些网页,也可以直接跳到 P.338 ,下载原书在 github 上提供的 amazonProductInfo.db 文件:https://github.com/crubba/Wiley-ADCR/blob/master/ch-16-amazon/amazonProductInfo.db ,继续进行分析。

nuomijii commented 6 years ago

@coderLMN 老师,您能具体说说如果采用selenium的具体方法吗?或者有没有和amazon类似的网站可以替换?谢谢

coderLMN commented 6 years ago

Selenium 的具体用法可以参考 9.1.9 节的内容,类似网站你可以看看其他的电商网站是否可用。

nuomijii commented 6 years ago

@coderLMN @coderLMN 老师,我重新找了一个电商网址,但是在对250个产品网页进行分块解析后,在提取他的价格就只有145个了,缺少了105个数据,不知道是什么原因?

url<-"https://www.ebay.com/sch/i.html?_from=R40&_trksid=m570.l1313&_nkw=Apple"
firstSearchPage <- getURL(url)
SearchPages           <- list()
SearchPages[[1]]      <- firstSearchPage
#下一页链接的Xpath表达式
xpath           <- "//a[@class='x-pagination__control' and @rel='next']/@href"
nextPageLink <- xpathApply( htmlParse(SearchPages[[1]]), xpath)
nextPageLink
#提取下一页的链接
for( i in 2:5 ){
    nextPageLink <- xpathApply( htmlParse(SearchPages[[i-1]]), xpath)
    SearchPages[[ i ]] <- getURL(nextPageLink)
  }
#提取产品信息
#提取标题
xpathApply( htmlParse(SearchPages[[1]]), "//h3[@class='s-item__title']", xmlValue)[1:2]
extractTitle <- function(x){
  unlist(xpathApply( htmlParse(x), "//h3[@class='s-item__title']", xmlValue))
}
titles <- unlist(lapply(SearchPages, extractTitle))
#提取链接
xpathApply( htmlParse(SearchPages[[1]]), "//a[@class='s-item__link']/@href")[1:2]
extractLink <- function(x){
  unlist(xpathApply( htmlParse(x), "//a[@class='s-item__link']/@href"))
}
names <- unlist(lapply(SearchPages, extractLink))
#去除空格和href
links <- unlist(str_extract_all(links,"http.+"))
links[1:5]
#解析所有页面(采集产品页面)
chunk <- function(x,n) split(x, ceiling(seq_along(x)/n))
Links <- chunk(links,10)
curl  <- getCurlHandle()
ProductPages  <- list()
counter <- 1 
for(i in 1:length(Links)){
    ProductPages <- c(ProductPages, getURL(Links[[i]],curl = curl))
    Sys.sleep(0.5)
}
ParsedProductPages <- lapply(ProductPages, htmlParse)
length(ProductPages)
#产品价格
xpathApply( ParsedProductPages[[50]], '//span[@id="convbidPrice"]', xmlValue )

extractPrice <- function(x){
  x <- xpathApply( x, '//span[@id="convbidPrice"]', xmlValue )
  x <- unlist(x)
  return(x)
}
prices <- unlist(lapply(ParsedProductPages, extractPrice))
length(prices)  ##为什么出来的只有145个数据,缺少105个
names(prices) <- NULL
prices

> length(ProductPages)
[1] 250
> length(prices)
[1] 145
coderLMN commented 6 years ago

你可以仔细看一下 ProductPages 里的内容,分析为什么价格出不来,也许是 xpath 有问题。