Closed benjaminrholmes closed 1 year ago
Hi, As defined in a google search,
The HTTP 403 Forbidden response status code indicates that the server understands the request but refuses to authorize it.
It would appear you have been blocked from accessing their servers for being in violation of their terms (see here: https://www.sports-reference.com/bot-traffic.html).
I think if you give it some time, you'll be allowed to scrape again. Remember to be mindful and ensure your time_pause
number is set sufficiently high in all FBref functions.
I will close this issue now as it's not related to the functioning of the library. Reach out if there's anything else though.
Hi,
I did think that was the case, however, my own scripts in python scraping fbref seem to be fine. Which made me doubt it was an over-request issue. I also waited 24 hours since I last executed the fb_player_scouting_report function and still no luck. I will just wait more days and retry.
Thank you for your help.
Hi @JaseZiv, I have the same issue with functions that extracts data from FBref. I've tried different networks and different laptops, but it throws HTTP error 403
anyway. I didn't use worldfootballR since the end of the 2021-2022 season, so it's hard to believe I violated their scrapping data terms. It worked fine the entire season and the very last day of it, but now it doesn't.
@benjaminrholmes does it work for you now?
@artiebits Can you please send through the code that you used to get the 403?
Thanks for reopening the issue and investigating it.
library(worldfootballR)
library(lubridate)
library(dplyr)
countries <- c("ENG", "ESP")
for (country in countries) {
print(paste("Getting data for", country))
data <- get_match_results(country = country, gender = "M", season_end_year = 2010:2022)
fixture <- data %>%
filter(Date >= lubridate::today()) %>%
select(Date, Time, Home, Away)
history <- data %>%
filter(Date < lubridate::today()) %>%
select(Date, Home, Away, HomeGoals, AwayGoals)
write.csv(fixture, paste0("data/", country, "-fixture.csv"))
write.csv(history, paste0("data/", country, ".csv"))
}
print("All data downloaded")
Two more things...
What version of the library are you using?
Additionally, can you paste in the output you get from running this line of code:
httr::GET("http://httpbin.org/user-agent")
The version is 0.5.7.
The output:
Response [http://httpbin.org/user-agent]
Date: 2022-08-03 06:09
Status: 200
Content-Type: application/json
Size: 61 B
{
"user-agent": "libcurl/7.79.1 r-curl/4.3.2 httr/1.4.3"
}
Before you run any of the FBref functions, add this to the start of your script (or substitute the user agent if you're on another software):
httr::set_config(httr::user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"))
I don't know that this will help, but is worth a shot. Otherwise, I suspect your IP has been blocked more permanently than they say on their site?
Unfortunately, it doesn't help. If I place httr::GET("http://httpbin.org/user-agent")
after the code you proposed, then I see that my user agent has changed. However, I still get the same error :/
Yeah this then looks like it could be a flat ban on your IP... you might have to reach out to them to see if you can get it lifted?
Hi guys. I had same issue; I tried the above attempts but not successful. However, the issue HTTP 403 only appears when I'm using VSCode (running the same code in RStudio works fine). I guess that problem is in the VSCode extension and not with IP address.
Hi guys. I had same issue; I tried the above attempts but not successful. However, the issue HTTP 403 only appears when I'm using VSCode (running the same code in RStudio works fine). I guess that problem is in the VSCode extension and not with IP address.
You could be right... I find that when I run some functions in RStudio locally, runs fine...when I run the same functions in GitHub Actions, 403s...
Hey everyone, same 403 issue for get_team_match_results() for fbref. Working perfectly on RStudio running on linux server but once the task is scheduled using cron it returns 403, despite editing user_agent. Exactly the same in GitHub actions and also tried running from Databricks cluster all 403. I imagine any deployed Shiny app using these functions would also fail.
FYI I found a hacky workaround for this error (for me at least). It does seem like a user-agent issue, but I can only get one to work "RStudio Desktop (2022.7.1.554); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)"
In the get_team_match_results function I edited the section that uses xml2::read_html and replaced it with an rvest::html_session with the RStudio Desktop ua and then pipe that to `read_html() and that seems to have solved my issue. Hopefully it works for you guys too!
function (team_url, time_pause = 3)
{
time_wait <- time_pause
get_each_team_log <- function(team_url, time_pause = time_wait) {
pb$tick()
Sys.sleep(time_pause)
ua <- user_agent("RStudio Desktop (2022.7.1.554); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)")
team_page <- rvest::html_session(team_url,ua) %>% read_html()
team_name <- sub(".*\\/", "", team_url) %>% gsub("-Stats",
"", .) %>% gsub("-", " ", .)
opponent_names <- team_page %>% rvest::html_nodes(".left:nth-child(10) a") %>%
rvest::html_text()
team_log <- team_page %>% rvest::html_nodes("#all_matchlogs") %>%
rvest::html_nodes("table") %>% rvest::html_table() %>%
data.frame()
team_log$Opponent <- opponent_names
team_log <- team_log %>% dplyr::mutate(Team_Url = team_url,
Team = team_name) %>% dplyr::select(.data$Team_Url,
.data$Team, dplyr::everything(), -.data$Match.Report)
team_log <- team_log %>% dplyr::mutate(Attendance = gsub(",",
"", .data$Attendance) %>% as.numeric(), GF = as.character(.data$GF),
GA = as.character(.data$GA))
return(team_log)
}
pb <- progress::progress_bar$new(total = length(team_url))
all_team_logs <- team_url %>% purrr::map_df(get_each_team_log)
}
Hi all,
Hoping this issue has been resolved for a lot of the fbref functions as of version 0.5.12.3000
. The fix hasn't been implemented for functions.
Thanks to @oliverp6 for the inspiration and @tonyelhabr for the help implementing this!
Will keep this issue open for a little while to confirm things
Hello,
First time using the worldfootballR package and have come across an error using the example code in the docs:
CODE:
OUTPUT:
Error in open.connection(x, "rb") (worldfootball.R#9): HTTP error 403.
Also sessionInfo() OUTPUT:
Any help you can provide is much appreciated
Cheers