JaseZiv / worldfootballR

A wrapper for extracting world football (soccer) data from FBref, Transfermark, Understat
https://jaseziv.github.io/worldfootballR/
450 stars 60 forks source link

Mismatch between available comps in Readme and in rds files #374

Open BenoitLondon opened 3 months ago

BenoitLondon commented 3 months ago

Competition list is outdated in the README for load_match_comp_results

Some competitions have changed name in the rds files e.g English Football League Cup is now EFL Cup Copa America too and UEFA Euro comps

More generally it would be nice to have a function listing all the competitions/country/league id which are available for each load functions so we could get the load data programmatically.

Thanks for this great package!

tonyelhabr commented 2 months ago

RE: load functions. I like the idea of having a function to list possible cups, comps, and years. We'll have to think of the right way to automate it. Right now, it wouldn't be hard to write a function to list available country+gender+tier for a given data set.

DATA_REPO <- 'JaseZiv/worldfootballR_data'
get_possible_stashed_data <- function(tag, include_years = FALSE) {
  raw <- piggyback::pb_list(DATA_REPO, tag = tag)

  grid <- raw |> 
    tibble::as_tibble() |> 
    dplyr::filter(
      tools::file_ext(file_name) == 'rds'
    ) |> 
    dplyr::select(file_name) |> 
    tidyr::separate_wider_regex(
      file_name,
      c(country = '^[A-Z]+', '_', gender = '[MF]', '_', tier = '1st|2nd', '_', extra = '.*$'),
      cols_remove = FALSE
    ) |> 
    dplyr::select(
      file_name,
      country,
      gender,
      tier
    )
  grid

  if (isFALSE(include_years)) {
    return(grid |> dplyr::select(-file_name))
  }

  ## would have to read in files to identify years
}

possible_data <- get_possible_stashed_data(
  tag = 'fb_match_summary'
)
possible_data
#> # A tibble: 13 × 3
#>    country gender tier 
#>    <chr>   <chr>  <chr>
#>  1 BRA     M      1st  
#>  2 ENG     F      1st  
#>  3 ENG     M      1st  
#>  4 ENG     M      2nd  
#>  5 ESP     M      1st  
#>  6 FRA     M      1st  
#>  7 GER     M      1st  
#>  8 ITA     M      1st  
#>  9 MEX     M      1st  
#> 10 NED     M      1st  
#> 11 POR     M      1st  
#> 12 USA     F      1st  
#> 13 USA     M      1st 

It becomes more involved if you want to list seasons as well, since, as of now, we don't store that in a CSV anywhere, nor in the name of the stashed data files (which is why it's not hard to extract country, gender, and tier). As things stand now, you'd have to read in the data file, then extract the unique seasons. The data files can be slow to load, so this is not ideal.

I'd have to think of a robust solution to this.

tonyelhabr commented 2 months ago

RE: mismatched names. Yes, I've seen this kind of things with MLS team names, where they changed the name of a team at some point (e.g. 'Sporting Kansas City' -> 'Sporting KC'), either during the middle of the season or between seasons.

I'm not sure what the best, general solution is to ensuring name consistency over time. Perhaps, we could re-scrape data like a year after it occurred, assuming that names are no longer being changed at that point. Obviously this would take a lot of time. Perhaps there are shortcuts for checking self-consistency.

DDE1989 commented 3 weeks ago

Maybe tying the names to a team ID? In the URLS in FBRef teams appear to have an ID

For example, Grenoble Foot appears to have the id: 40aa7280

https://fbref.com/en/squads/40aa7280/Grenoble-Foot-Stats

Maybe that can be scrapped and used to track changes to names?