JaseZiv / worldfootballR

A wrapper for extracting world football (soccer) data from FBref, Transfermark, Understat
https://jaseziv.github.io/worldfootballR/
444 stars 60 forks source link

get_match_summary() unable to pull valid game summaries #138

Closed zecellomaster closed 2 years ago

zecellomaster commented 2 years ago

Hi! I'm trying to pull the match summary of specific games to determine whether goals were scored in regular or extra time. The match in question is clearly a valid URL and has a summary, but when I try to run it, I see this response. Perhaps I am missing something, but just in case, here is what I did:

test <- get_match_report(match_url = "https://fbref.com/en/matches/e4b36b84/Portland-Timbers-New-York-City-FC-December-11-2021-Major-League-Soccer")
https://fbref.com/en/matches/e4b36b84/Portland-Timbers-New-York-City-FC-December-11-2021-Major-League-Soccer is not available
Error in `dplyr::left_join()`:
! Join columns must be present in data.
✖ Problem with `League_URL`.
Run `rlang::last_error()` to see where the error occurred.

Full disclosure, the above was run in the RStudio console and not sourced. I updated the package to 0.5.6 after the first time this error popped up, and since that version has the 3 second delay built in, I am hoping I have not been blocked by the website (though I doubt this as I am still able to access the page via my web browser). Below is my session info.

─ Session info ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.1.2 (2021-11-01)
 os       macOS Mojave 10.14.6
 system   x86_64, darwin17.0
 ui       RStudio
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2022-06-19
 rstudio  2021.09.2+382 Ghost Orchid (desktop)
 pandoc   NA

─ Packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 ! package        * version date (UTC) lib source
   assertthat       0.2.1   2019-03-21 [1] CRAN (R 4.1.0)
   brio             1.1.3   2021-11-30 [1] CRAN (R 4.1.0)
   cachem           1.0.6   2021-08-19 [1] CRAN (R 4.1.0)
   callr            3.7.0   2021-04-20 [1] CRAN (R 4.1.0)
   cli              3.3.0   2022-04-25 [1] CRAN (R 4.1.2)
   crayon           1.5.1   2022-03-26 [1] CRAN (R 4.1.2)
   curl             4.3.2   2021-06-23 [1] CRAN (R 4.1.0)
   DBI              1.1.2   2021-12-20 [1] CRAN (R 4.1.0)
   desc             1.4.0   2021-09-28 [1] CRAN (R 4.1.0)
   devtools       * 2.4.3   2021-11-30 [1] CRAN (R 4.1.0)
   dplyr            1.0.9   2022-04-28 [1] CRAN (R 4.1.2)
   ellipsis         0.3.2   2021-04-29 [1] CRAN (R 4.1.0)
   fansi            1.0.3   2022-03-24 [1] CRAN (R 4.1.2)
   fastmap          1.1.0   2021-01-25 [1] CRAN (R 4.1.0)
   fs               1.5.2   2021-12-08 [1] CRAN (R 4.1.0)
   generics         0.1.2   2022-01-31 [1] CRAN (R 4.1.2)
   glue             1.6.2   2022-02-24 [1] CRAN (R 4.1.2)
   hms              1.1.1   2021-09-26 [1] CRAN (R 4.1.0)
   httr             1.4.3   2022-05-04 [1] CRAN (R 4.1.2)
   janitor          2.1.0   2021-01-05 [1] CRAN (R 4.1.0)
   jsonlite         1.8.0   2022-02-22 [1] CRAN (R 4.1.2)
   lifecycle        1.0.1   2021-09-24 [1] CRAN (R 4.1.0)
   lubridate        1.8.0   2021-10-07 [1] CRAN (R 4.1.0)
   magrittr         2.0.3   2022-03-30 [1] CRAN (R 4.1.2)
   memoise          2.0.1   2021-11-26 [1] CRAN (R 4.1.0)
   pillar           1.7.0   2022-02-01 [1] CRAN (R 4.1.2)
   pkgbuild         1.3.1   2021-12-20 [1] CRAN (R 4.1.0)
   pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 4.1.0)
   pkgload          1.2.4   2021-11-30 [1] CRAN (R 4.1.0)
   prettyunits      1.1.1   2020-01-24 [1] CRAN (R 4.1.0)
   processx         3.5.2   2021-04-30 [1] CRAN (R 4.1.0)
   progress         1.2.2   2019-05-16 [1] CRAN (R 4.1.0)
   ps               1.6.0   2021-02-28 [1] CRAN (R 4.1.0)
   purrr            0.3.4   2020-04-17 [1] CRAN (R 4.1.0)
   R6               2.5.1   2021-08-19 [1] CRAN (R 4.1.0)
   readr            2.1.2   2022-01-30 [1] CRAN (R 4.1.2)
   remotes          2.4.2   2021-11-30 [1] CRAN (R 4.1.0)
   rlang            1.0.2   2022-03-04 [1] CRAN (R 4.1.2)
   rprojroot        2.0.2   2020-11-15 [1] CRAN (R 4.1.0)
   rstudioapi       0.13    2020-11-12 [1] CRAN (R 4.1.0)
   rvest            1.0.2   2021-10-16 [1] CRAN (R 4.1.0)
   sessioninfo      1.2.2   2021-12-06 [1] CRAN (R 4.1.0)
   snakecase        0.11.0  2019-05-25 [1] CRAN (R 4.1.0)
   stringi          1.7.6   2021-11-29 [1] CRAN (R 4.1.0)
   stringr          1.4.0   2019-02-10 [1] CRAN (R 4.1.0)
   testthat         3.1.2   2022-01-20 [1] CRAN (R 4.1.2)
   tibble           3.1.7   2022-05-03 [1] CRAN (R 4.1.2)
   tidyr            1.2.0   2022-02-01 [1] CRAN (R 4.1.2)
   tidyselect       1.1.2   2022-02-21 [1] CRAN (R 4.1.2)
   tzdb             0.3.0   2022-03-28 [1] CRAN (R 4.1.2)
   usethis        * 2.1.5   2021-12-09 [1] CRAN (R 4.1.0)
   utf8             1.2.2   2021-07-24 [1] CRAN (R 4.1.0)
   vctrs            0.4.1   2022-04-13 [1] CRAN (R 4.1.2)
   withr            2.5.0   2022-03-03 [1] CRAN (R 4.1.2)
 V worldfootballR * 0.5.6   2022-06-20 [1] Github (JaseZiv/worldfootballR@b2d49ee) (on disk 0.5.6.1000)
   xml2             1.3.3   2021-11-30 [1] CRAN (R 4.1.0)

 [1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library

 V ── Loaded and on-disk version mismatch.

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

This is also my first time making an writing about an issue on Github. I hope I did well.

JaseZiv commented 2 years ago

Hi,

This is a perfectly written and formatted issue! Thanks so much!

As for the issue itself - I can't seem to recreate this. I wonder if you have been blocked.

Can you try again in a few hours and let me know if the problems persist. My environment is very similar to yours so wondering if it is because you were rate limited for a short period.

You could always try another fbref function to see if that works to know whether you've been blocked or not?

Thanks!

zecellomaster commented 2 years ago

Okay, so I have removed/reinstalled the package, restarted R, waited over a day, and tried a variety of functions. None of which seem to work (scream). The errors that were returned before from match data functions have remained the same, while the team/season pages throw back a different one:

pls_help_me <- fb_player_urls("https://fbref.com/en/squads/fd962109/Fulham-Stats")
Error in open.connection(x, "rb") : 
  SSL certificate problem: certificate has expired

I also seem to keep getting the 'echos' of warning messages past, even when I am doing something completely different.

Warning message:
In .Internal(gc(verbose, reset, full)) :
  closing unused connection 3 (https://fbref.com/en/squads/fd962109/Fulham-Stats)

I really suspect that it has something to do with the package (or at least my implementation of it) because I recall that when I scraped data from FBRef's pages too quickly, I was locked out of the site even in my browser (or the internet at my old place was just that bad).

I'm at wits end here. I used remove.packages() to uninstall worldfootballR, reinstalled it using devtools, then did the same again with CRAN when that didn't work. Could that be the issue? Here's the session info:

R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.5.1        stringr_1.4.0        dplyr_1.0.9          purrr_0.3.4          readr_2.1.2         
 [6] tidyr_1.2.0          tibble_3.1.7         ggplot2_3.3.6        tidyverse_1.3.1      worldfootballR_0.5.6

loaded via a namespace (and not attached):
 [1] cellranger_1.1.0 pillar_1.7.0     compiler_4.1.2   dbplyr_2.1.1     tools_4.1.2      gtable_0.3.0    
 [7] lubridate_1.8.0  jsonlite_1.8.0   lifecycle_1.0.1  pkgconfig_2.0.3  rlang_1.0.2      reprex_2.0.1    
[13] rstudioapi_0.13  DBI_1.1.2        cli_3.3.0        curl_4.3.2       haven_2.5.0      withr_2.5.0     
[19] httr_1.4.3       janitor_2.1.0    xml2_1.3.3       fs_1.5.2         generics_0.1.2   vctrs_0.4.1     
[25] hms_1.1.1        grid_4.1.2       tidyselect_1.1.2 snakecase_0.11.0 glue_1.6.2       R6_2.5.1        
[31] fansi_1.0.3      readxl_1.4.0     modelr_0.1.8     tzdb_0.3.0       magrittr_2.0.3   scales_1.2.0    
[37] backports_1.4.1  ellipsis_0.3.2   assertthat_0.2.1 rvest_1.0.2      colorspace_2.0-3 utf8_1.2.2      
[43] stringi_1.7.6    munsell_0.5.0    broom_0.7.12     crayon_1.5.1    

Apologies for continuing to use up your time here, but this whole thing is odd. I've used this very package before with no issues whatsoever.

Platon-7 commented 2 years ago

Hello, I am trying to use the data and in the .rmd file it stops in the first line which is

match_urls<- worldfootballR::get_match_urls(country = "ENG", gender = "M", season_end_year = c(2021:2022))

When I tried it in the console R local environment though it was alright.

The reason I am writing this here is because I noticed we had a similar error message, mine is in the screenshot below. If you have any idea what the problem is it would be very helpful.

image

JaseZiv commented 2 years ago

Hello, I am trying to use the data and in the .rmd file it stops in the first line which is

match_urls<- worldfootballR::get_match_urls(country = "ENG", gender = "M", season_end_year = c(2021:2022))

When I tried it in the console R local environment though it was alright.

The reason I am writing this here is because I noticed we had a similar error message, mine is in the screenshot below. If you have any idea what the problem is it would be very helpful.

image

Hi, As defined in a google search,

The HTTP 403 Forbidden response status code indicates that the server understands the request but refuses to authorize it.

It would appear you have been blocked from accessing their servers for being in violation of their terms (see here: https://www.sports-reference.com/bot-traffic.html).

I think if you give it some time, you'll be allowed to scrape again. Remember to be mindful and ensure your time_pause number is set sufficiently high in all FBref functions.

JaseZiv commented 2 years ago

Okay, so I have removed/reinstalled the package, restarted R, waited over a day, and tried a variety of functions. None of which seem to work (scream). The errors that were returned before from match data functions have remained the same, while the team/season pages throw back a different one:

pls_help_me <- fb_player_urls("https://fbref.com/en/squads/fd962109/Fulham-Stats")
Error in open.connection(x, "rb") : 
  SSL certificate problem: certificate has expired

I also seem to keep getting the 'echos' of warning messages past, even when I am doing something completely different.

Warning message:
In .Internal(gc(verbose, reset, full)) :
  closing unused connection 3 (https://fbref.com/en/squads/fd962109/Fulham-Stats)

I really suspect that it has something to do with the package (or at least my implementation of it) because I recall that when I scraped data from FBRef's pages too quickly, I was locked out of the site even in my browser (or the internet at my old place was just that bad).

I'm at wits end here. I used remove.packages() to uninstall worldfootballR, reinstalled it using devtools, then did the same again with CRAN when that didn't work. Could that be the issue? Here's the session info:

R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.5.1        stringr_1.4.0        dplyr_1.0.9          purrr_0.3.4          readr_2.1.2         
 [6] tidyr_1.2.0          tibble_3.1.7         ggplot2_3.3.6        tidyverse_1.3.1      worldfootballR_0.5.6

loaded via a namespace (and not attached):
 [1] cellranger_1.1.0 pillar_1.7.0     compiler_4.1.2   dbplyr_2.1.1     tools_4.1.2      gtable_0.3.0    
 [7] lubridate_1.8.0  jsonlite_1.8.0   lifecycle_1.0.1  pkgconfig_2.0.3  rlang_1.0.2      reprex_2.0.1    
[13] rstudioapi_0.13  DBI_1.1.2        cli_3.3.0        curl_4.3.2       haven_2.5.0      withr_2.5.0     
[19] httr_1.4.3       janitor_2.1.0    xml2_1.3.3       fs_1.5.2         generics_0.1.2   vctrs_0.4.1     
[25] hms_1.1.1        grid_4.1.2       tidyselect_1.1.2 snakecase_0.11.0 glue_1.6.2       R6_2.5.1        
[31] fansi_1.0.3      readxl_1.4.0     modelr_0.1.8     tzdb_0.3.0       magrittr_2.0.3   scales_1.2.0    
[37] backports_1.4.1  ellipsis_0.3.2   assertthat_0.2.1 rvest_1.0.2      colorspace_2.0-3 utf8_1.2.2      
[43] stringi_1.7.6    munsell_0.5.0    broom_0.7.12     crayon_1.5.1    

Apologies for continuing to use up your time here, but this whole thing is odd. I've used this very package before with no issues whatsoever.

Interesting. I think this closed issue might give you some clues - typically caused by an old OS... https://github.com/JaseZiv/worldfootballR/issues/83

zecellomaster commented 2 years ago

Hello again,

~~ Do you know if the blocking period is extended if you try to get data whilst it is still active (since it's been almost 3 days for me)? Is it possible to be blocked on R, but still able to access pages on the browser and through Python (as I am currently)?

If the answer to the first question is false, then "a half day and then have access returned after a modest period of time" may mean quite a while. ~~

[Edit] Thanks also for the reply to my other comment, just saw it.

zecellomaster commented 2 years ago

Just saw your comment. I am also running Mojave. It will be a task and a half to update this thing due to limited space, but I'll see what happens and let y'all know if things change.

JaseZiv commented 2 years ago

Hello again,

~~ Do you know if the blocking period is extended if you try to get data whilst it is still active (since it's been almost 3 days for me)? Is it possible to be blocked on R, but still able to access pages on the browser and through Python (as I am currently)?

If the answer to the first question is false, then "a half day and then have access returned after a modest period of time" may mean quite a while. ~~

[Edit] Thanks also for the reply to my other comment, just saw it.

Great questions... not sure about the answer to Q1. As for Q2, absolutely possible as the headers will be different so will think its a different requestor

zecellomaster commented 1 year ago

Quick update: I was able to update my OS to Monterey and the problem was solved. It indeed appears that the issue was that I was running it on Mojave. I was having so much fun I forgot to update this! 😅 Thanks for the advice and support!