JaseZiv / worldfootballR

A wrapper for extracting world football (soccer) data from FBref, Transfermark, Understat
https://jaseziv.github.io/worldfootballR/
444 stars 60 forks source link

Error using fb_player_scouting_report() function #140

Closed benjaminrholmes closed 1 year ago

benjaminrholmes commented 2 years ago

Hello,

First time using the worldfootballR package and have come across an error using the example code in the docs:

CODE:

install.packages("worldfootballR")
library(worldfootballR)
library(dplyr)

scout <- fb_player_scouting_report(player_url = "https://fbref.com/en/players/d70ce98e/Lionel-Messi",
                                   pos_versus = "primary") %>%
               dplyr::filter(scouting_period == "Last 365 Days")

OUTPUT: Error in open.connection(x, "rb") (worldfootball.R#9): HTTP error 403.

Also sessionInfo() OUTPUT:

R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tidyr_1.2.0          dplyr_1.0.9          GGally_2.1.2         ggplot2_3.3.6        worldfootballR_0.5.6

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8         plyr_1.8.7         RColorBrewer_1.1-2 pillar_1.7.0       compiler_4.1.2    
 [6] tools_4.1.2        bit_4.0.4          gtable_0.3.0       lubridate_1.8.0    jsonlite_1.8.0    
[11] lifecycle_1.0.1    tibble_3.1.6       pkgconfig_2.0.3    rlang_1.0.2        DBI_1.1.2         
[16] cli_3.3.0          curl_4.3.2         parallel_4.1.2     withr_2.4.3        httr_1.4.2        
[21] stringr_1.4.0      janitor_2.1.0      xml2_1.3.3         generics_0.1.2     vctrs_0.4.1       
[26] hms_1.1.1          grid_4.1.2         bit64_4.0.5        tidyselect_1.1.2   reshape_0.8.9     
[31] snakecase_0.11.0   glue_1.6.1         R6_2.5.1           fansi_1.0.2        vroom_1.5.7       
[36] purrr_0.3.4        readr_2.1.2        tzdb_0.2.0         magrittr_2.0.2     scales_1.1.1      
[41] ellipsis_0.3.2     assertthat_0.2.1   rvest_1.0.2        colorspace_2.0-2   utf8_1.2.2        
[46] stringi_1.7.6      munsell_0.5.0      crayon_1.5.0      

Any help you can provide is much appreciated

Cheers

JaseZiv commented 2 years ago

Hi, As defined in a google search,

The HTTP 403 Forbidden response status code indicates that the server understands the request but refuses to authorize it.

It would appear you have been blocked from accessing their servers for being in violation of their terms (see here: https://www.sports-reference.com/bot-traffic.html).

I think if you give it some time, you'll be allowed to scrape again. Remember to be mindful and ensure your time_pause number is set sufficiently high in all FBref functions.

I will close this issue now as it's not related to the functioning of the library. Reach out if there's anything else though.

benjaminrholmes commented 2 years ago

Hi,

I did think that was the case, however, my own scripts in python scraping fbref seem to be fine. Which made me doubt it was an over-request issue. I also waited 24 hours since I last executed the fb_player_scouting_report function and still no luck. I will just wait more days and retry.

Thank you for your help.

artiebits commented 1 year ago

Hi @JaseZiv, I have the same issue with functions that extracts data from FBref. I've tried different networks and different laptops, but it throws HTTP error 403 anyway. I didn't use worldfootballR since the end of the 2021-2022 season, so it's hard to believe I violated their scrapping data terms. It worked fine the entire season and the very last day of it, but now it doesn't.

@benjaminrholmes does it work for you now?

JaseZiv commented 1 year ago

@artiebits Can you please send through the code that you used to get the 403?

artiebits commented 1 year ago

Thanks for reopening the issue and investigating it.

library(worldfootballR)
library(lubridate)
library(dplyr)

countries <- c("ENG", "ESP")

for (country in countries) {
  print(paste("Getting data for", country))

  data <- get_match_results(country = country, gender = "M", season_end_year = 2010:2022)

  fixture <- data %>%
    filter(Date >= lubridate::today()) %>%
    select(Date, Time, Home, Away)

  history <- data %>%
    filter(Date < lubridate::today()) %>%
    select(Date, Home, Away, HomeGoals, AwayGoals)

  write.csv(fixture, paste0("data/", country, "-fixture.csv"))
  write.csv(history, paste0("data/", country, ".csv"))
}

print("All data downloaded")
JaseZiv commented 1 year ago

Two more things...

What version of the library are you using?

Additionally, can you paste in the output you get from running this line of code: httr::GET("http://httpbin.org/user-agent")

artiebits commented 1 year ago

The version is 0.5.7.

The output:

Response [http://httpbin.org/user-agent]
  Date: 2022-08-03 06:09
  Status: 200
  Content-Type: application/json
  Size: 61 B
{
  "user-agent": "libcurl/7.79.1 r-curl/4.3.2 httr/1.4.3"
}
JaseZiv commented 1 year ago

Before you run any of the FBref functions, add this to the start of your script (or substitute the user agent if you're on another software):

httr::set_config(httr::user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"))

I don't know that this will help, but is worth a shot. Otherwise, I suspect your IP has been blocked more permanently than they say on their site?

artiebits commented 1 year ago

Unfortunately, it doesn't help. If I place httr::GET("http://httpbin.org/user-agent") after the code you proposed, then I see that my user agent has changed. However, I still get the same error :/

JaseZiv commented 1 year ago

Yeah this then looks like it could be a flat ban on your IP... you might have to reach out to them to see if you can get it lifted?

matheussrod commented 1 year ago

Hi guys. I had same issue; I tried the above attempts but not successful. However, the issue HTTP 403 only appears when I'm using VSCode (running the same code in RStudio works fine). I guess that problem is in the VSCode extension and not with IP address.

JaseZiv commented 1 year ago

Hi guys. I had same issue; I tried the above attempts but not successful. However, the issue HTTP 403 only appears when I'm using VSCode (running the same code in RStudio works fine). I guess that problem is in the VSCode extension and not with IP address.

You could be right... I find that when I run some functions in RStudio locally, runs fine...when I run the same functions in GitHub Actions, 403s...

oliverp6 commented 1 year ago

Hey everyone, same 403 issue for get_team_match_results() for fbref. Working perfectly on RStudio running on linux server but once the task is scheduled using cron it returns 403, despite editing user_agent. Exactly the same in GitHub actions and also tried running from Databricks cluster all 403. I imagine any deployed Shiny app using these functions would also fail.

oliverp6 commented 1 year ago

FYI I found a hacky workaround for this error (for me at least). It does seem like a user-agent issue, but I can only get one to work "RStudio Desktop (2022.7.1.554); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)"

In the get_team_match_results function I edited the section that uses xml2::read_html and replaced it with an rvest::html_session with the RStudio Desktop ua and then pipe that to `read_html() and that seems to have solved my issue. Hopefully it works for you guys too!

function (team_url, time_pause = 3) 
{
    time_wait <- time_pause
    get_each_team_log <- function(team_url, time_pause = time_wait) {
        pb$tick()
        Sys.sleep(time_pause)
        ua <- user_agent("RStudio Desktop (2022.7.1.554); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)")
        team_page <- rvest::html_session(team_url,ua) %>% read_html()
        team_name <- sub(".*\\/", "", team_url) %>% gsub("-Stats", 
            "", .) %>% gsub("-", " ", .)
        opponent_names <- team_page %>% rvest::html_nodes(".left:nth-child(10) a") %>% 
            rvest::html_text()
        team_log <- team_page %>% rvest::html_nodes("#all_matchlogs") %>% 
            rvest::html_nodes("table") %>% rvest::html_table() %>% 
            data.frame()
        team_log$Opponent <- opponent_names
        team_log <- team_log %>% dplyr::mutate(Team_Url = team_url, 
            Team = team_name) %>% dplyr::select(.data$Team_Url, 
            .data$Team, dplyr::everything(), -.data$Match.Report)
        team_log <- team_log %>% dplyr::mutate(Attendance = gsub(",", 
            "", .data$Attendance) %>% as.numeric(), GF = as.character(.data$GF), 
            GA = as.character(.data$GA))
        return(team_log)
    }
    pb <- progress::progress_bar$new(total = length(team_url))
    all_team_logs <- team_url %>% purrr::map_df(get_each_team_log)
}
JaseZiv commented 1 year ago

Hi all,

Hoping this issue has been resolved for a lot of the fbref functions as of version 0.5.12.3000. The fix hasn't been implemented for functions.

Thanks to @oliverp6 for the inspiration and @tonyelhabr for the help implementing this!

Will keep this issue open for a little while to confirm things