JaseZiv / worldfootballR

A wrapper for extracting world football (soccer) data from FBref, Transfermark, Understat
https://jaseziv.github.io/worldfootballR/
466 stars 60 forks source link

Transfermarkt functions returning errors or blank outpout #278

Closed cg86x closed 1 year ago

cg86x commented 1 year ago

Transfermarkt functions are returning no data or causing errors. Scraping transfermarkt with my own scripts encounters the same issues.

library(worldfootballR)
packageVersion("worldfootballR")

# team urls
league_one_teams <- tm_league_team_urls(start_year = 2020, league_url = "https://www.transfermarkt.com/league-one/startseite/wettbewerb/GB3")

Error in tm_league_team_urls(start_year = 2020, league_url = "https://www.transfermarkt.com/league-one/startseite/wettbewerb/GB3") : 
  object 'league_page' not found

## player bios

hazard_bio <- tm_player_bio(player_url = "https://www.transfermarkt.com/eden-hazard/profil/spieler/50202")

hazard_bio
data frame with 0 columns and 0 rows

## with my own code outside the package

player_url <- "https://www.transfermarkt.com/joao-pedro/marktwertverlauf/spieler/626724"
response <- httr::GET(player_url)
content <- httr::content(response, as = "text")

content
[1] "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\" \"http://www.w3.org/TR/html4/loose.dtd\">\n<HTML><HEAD><META HTTP-EQUIV=\"Content-Type\" CONTENT=\"text/html; charset=iso-8859-1\">\n<TITLE>ERROR: The request could not be satisfied</TITLE>\n</HEAD><BODY>\n<H1>403 ERROR</H1>\n<H2>The request could not be satisfied.</H2>\n<HR noshade size=\"1px\">\nRequest blocked.\nWe can't connect to the server for this app or website at this time. There might be too much traffic or a configuration error. Try again later, or contact the app or website owner.\n<BR clear=\"all\">\nIf you provide content to customers through CloudFront, you can find steps to troubleshoot and help prevent this error by reviewing the CloudFront documentation.\n<BR clear=\"all\">\n<HR noshade size=\"1px\">\n<PRE>\nGenerated by cloudfront (CloudFront)\nRequest ID: gZJT76spCNJO8PAvmOfwKOSBsY-BJxmozHBZeHYwkf1Q9u9wJJ2niA==\n</PRE>\n<ADDRESS>\n</ADDRESS>\n</BODY></HTML>"
cg86x commented 1 year ago

FWIW I confirmed via https://web.archive.org/ that the robots.txt file has not changed recently.

cg86x commented 1 year ago

Solution that seemed to work for me was adding a header so httr is not treated as a bot. Assume the same would work for rvest and worldfootballR functions but haven't confirmed.

https://stackoverflow.com/questions/72568624/r-programming-download-file-returning-403-forbidden-error

tonyelhabr commented 1 year ago

Solution that seemed to work for me was adding a header so httr is not treated as a bot. Assume the same would work for rvest and worldfootballR functions but haven't confirmed.

https://stackoverflow.com/questions/72568624/r-programming-download-file-returning-403-forbidden-error

this seems like it's probably the right solution. we've had some issues in the past with the default httr header being blocked, although I think it was for fbref before.

JaseZiv commented 1 year ago

Solution that seemed to work for me was adding a header so httr is not treated as a bot. Assume the same would work for rvest and worldfootballR functions but haven't confirmed. https://stackoverflow.com/questions/72568624/r-programming-download-file-returning-403-forbidden-error

this seems like it's probably the right solution. we've had some issues in the past with the default httr header being blocked, although I think it was for fbref before.

Yeah spot on, it was for fbref... we also struggled with what to set the user agent - in the SO example provided they use 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36', but would we want to hard set that under the hood of the tm_ functions @tonyelhabr?

tonyelhabr commented 1 year ago

Solution that seemed to work for me was adding a header so httr is not treated as a bot. Assume the same would work for rvest and worldfootballR functions but haven't confirmed. https://stackoverflow.com/questions/72568624/r-programming-download-file-returning-403-forbidden-error

this seems like it's probably the right solution. we've had some issues in the past with the default httr header being blocked, although I think it was for fbref before.

Yeah spot on, it was for fbref... we also struggled with what to set the user agent - in the SO example provided they use 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36', but would we want to hard set that under the hood of the tm_ functions @tonyelhabr?

I think we can use .load_page() to solve our problems. See #283.

JaseZiv commented 1 year ago

@cg86x can you confirm if you're now able to use the tm_ functions now? No changes pushed to master so no need to update your worldfootballR version

JaseZiv commented 1 year ago

Closing this now as I'm not experiencing these issues any longer using the most recent version (0.6.3.0010)