dmi3kno / polite

Be nice on the web
https://dmi3kno.github.io/polite/
Other
326 stars 14 forks source link

polite isn't polite enough? #22

Closed DataStrategist closed 5 years ago

DataStrategist commented 5 years ago

It seems my IP got banned for running the following code, even though I didn't see any contraindications in the bow()?

session <- bow("https://www.azlyrics.com/b/beatles.html", force = TRUE)
session

result <- scrape(session) 

mainPage <- result %>%
  html_nodes(".album , #listAlbum a")

df <- tibble(text = mainPage %>% html_text(),
       link = mainPage %>% html_attr("href")) %>% 
  ## albumnames don't have links, let's use this:
  mutate(album = ifelse(is.na(link),text, NA)) %>% 
  ## drag down from above:
  fill(album) %>% 
  ## and finally remove entries w/out link since we already have the album
  filter(!is.na(link)) %>% 
  ## repair the link
  mutate(link = gsub("\\.\\.", "https://www.azlyrics.com/", link))

lyricsGetter <- function(x){
  print(x)
  Sys.sleep(5)
  x %>% bow %>% scrape %>% html_nodes("br+ div") %>% 
  ## only need first row
  head(1) %>% html_text
}

sample_n(df, 200) %>% pull(link) %>% map_chr(lyricsGetter)

Did I mess something up? I'm even waiting 5 seconds as per the bow() output...

here's my sess:

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rvest_0.3.2     xml2_1.2.0      polite_0.1.0    forcats_0.3.0   stringr_1.3.1  
 [6] dplyr_0.8.0.1   purrr_0.3.1     readr_1.3.1     tidyr_0.8.3     tibble_2.1.3   
[11] ggplot2_3.1.0   tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0        lubridate_1.7.4   here_0.1          lattice_0.20-38  
 [5] textshape_1.6.0   assertthat_0.2.0  rprojroot_1.3-2   digest_0.6.18    
 [9] utf8_1.1.4        R6_2.3.0          cellranger_1.1.0  plyr_1.8.4       
[13] backports_1.1.2   evaluate_0.10.1   httr_1.3.1        blogdown_0.8     
[17] pillar_1.3.1      rlang_0.4.0       curl_3.2          lazyeval_0.2.1   
[21] readxl_1.1.0      rstudioapi_0.10   data.table_1.11.8 Matrix_1.2-15    
[25] rmarkdown_1.11    tidytext_0.2.0    munsell_0.5.0     broom_0.5.2      
[29] compiler_3.5.2    janeaustenr_0.1.5 modelr_0.1.1      xfun_0.3         
[33] pkgconfig_2.0.2   htmltools_0.3.6   tidyselect_0.2.5  bookdown_0.7     
[37] fansi_0.4.0       crayon_1.3.4      withr_2.1.2       SnowballC_0.6.0  
[41] grid_3.5.2        nlme_3.1-137      jsonlite_1.6      gtable_0.2.0     
[45] magrittr_1.5      scales_1.0.0      tokenizers_0.2.1  cli_1.0.1        
[49] stringi_1.3.1     fs_1.2.6          robotstxt_0.6.2   syuzhet_1.0.4    
[53] ratelimitr_0.4.1  generics_0.0.2    tools_3.5.2       glue_1.3.0       
[57] hms_0.4.2         yaml_2.1.19       colorspace_1.3-2  memoise_1.1.0    
[61] knitr_1.20        haven_1.1.1       usethis_1.4.0    
dmi3kno commented 5 years ago

I got banned as well. I think their robots.txt is outdated. They might have changed the website structure, but forgot to update robots.txt Current file bans scraping of /lyricsdb without exceptions, but that folder does not seem to be in use anymore.

DataStrategist commented 5 years ago

ok. I contacted the admins.. let's see what they say. Closing