JaseZiv / worldfootballR

A wrapper for extracting world football (soccer) data from FBref, Transfermark, Understat
https://jaseziv.github.io/worldfootballR/
455 stars 60 forks source link

fb_advanced_match_stats: bug with older games #174

Closed endp01 closed 1 year ago

endp01 commented 2 years ago

When trying to scrape certain games I get this error message:

Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 2, 27

You can try to reproduce it with this function I used:

fb_advanced_match_stats(match_url="https://fbref.com/en/matches/b5a4e26d/Tottenham-Hotspur-Sheffield-United-January-21-2015-League-Cup", stat_type="summary", team_or_player="player", time_pause = 4)

From what I can see, the table with the stats should be found by the regex "summary$". Not sure where the function fails. Interestingly fb_match_summary() works with the match URL above (https://fbref.com/en/matches/b5a4e26d/Tottenham-Hotspur-Sheffield-United-January-21-2015-League-Cup).

Anyways, thanks for your great work on the package!

endp01 commented 2 years ago

Reason So after some testing I found out why the error happens.

  1. On matches with two legs (e.g. champions league quarter finals) the internal function .get_match_report_page in R/get_match_report.R returns two rows for a single game. The league variable in the first row is correct, while in the second row it shows the results of the other leg as the league. This is due to fbref formatting where the second leg is linked in the same \<div> as the league. (get_match_report.R line 25)
  2. This is not a big issue (just the league entry is wrong) until you try to scrape a game where an uneven number of players entered the pitch (due to substitutions). When this happens, cbind fails in get_advanced_match_stats.R line 150.

Example with uneven number of players (31): https://fbref.com/en/matches/e6066ef0/Sporting-CP-Manchester-City-February-15-2022-Champions-League This one will throw an error when trying fb_advanced_match_stats() on it.

match_page <- .load_page("https://fbref.com/en/matches/e6066ef0/Sporting-CP-Manchester-City-February-15-2022-Champions-League")
match_report <- .get_match_report_page(match_page = match_page)

Example with even number of players (28): https://fbref.com/en/matches/8a8090f1/Manchester-United-Roma-April-29-2021-Europa-League This one works with fb_advanced_match_stats() but will produce wrong entries in the league column.

match_page <- .load_page("https://fbref.com/en/matches/8a8090f1/Manchester-United-Roma-April-29-2021-Europa-League")
match_report <- .get_match_report_page(match_page = match_page)

Fix My suggestion to fix this would be to change the .get_match_report_page function in get_match_report.R (line 25) to just scrape the first occurence of the \<a> tag within the \<div>. Not sure if that's the cleanest way to go about it but it worked on my example above.

tryCatch( {League <- each_game_page %>% rvest::html_nodes("h1+ div a:nth-child(1)") %>% rvest::html_text()}, error = function(e) {League <- NA})
JaseZiv commented 2 years ago

Hi @endp01, sorry for the late reply!

I will investigate this and look to have a fix in the next few days.

Thanks for raising and offering a possible solution.