Open BenoitLondon opened 3 months ago
RE: load functions. I like the idea of having a function to list possible cups, comps, and years. We'll have to think of the right way to automate it. Right now, it wouldn't be hard to write a function to list available country+gender+tier for a given data set.
DATA_REPO <- 'JaseZiv/worldfootballR_data'
get_possible_stashed_data <- function(tag, include_years = FALSE) {
raw <- piggyback::pb_list(DATA_REPO, tag = tag)
grid <- raw |>
tibble::as_tibble() |>
dplyr::filter(
tools::file_ext(file_name) == 'rds'
) |>
dplyr::select(file_name) |>
tidyr::separate_wider_regex(
file_name,
c(country = '^[A-Z]+', '_', gender = '[MF]', '_', tier = '1st|2nd', '_', extra = '.*$'),
cols_remove = FALSE
) |>
dplyr::select(
file_name,
country,
gender,
tier
)
grid
if (isFALSE(include_years)) {
return(grid |> dplyr::select(-file_name))
}
## would have to read in files to identify years
}
possible_data <- get_possible_stashed_data(
tag = 'fb_match_summary'
)
possible_data
#> # A tibble: 13 × 3
#> country gender tier
#> <chr> <chr> <chr>
#> 1 BRA M 1st
#> 2 ENG F 1st
#> 3 ENG M 1st
#> 4 ENG M 2nd
#> 5 ESP M 1st
#> 6 FRA M 1st
#> 7 GER M 1st
#> 8 ITA M 1st
#> 9 MEX M 1st
#> 10 NED M 1st
#> 11 POR M 1st
#> 12 USA F 1st
#> 13 USA M 1st
It becomes more involved if you want to list seasons as well, since, as of now, we don't store that in a CSV anywhere, nor in the name of the stashed data files (which is why it's not hard to extract country, gender, and tier). As things stand now, you'd have to read in the data file, then extract the unique seasons. The data files can be slow to load, so this is not ideal.
I'd have to think of a robust solution to this.
RE: mismatched names. Yes, I've seen this kind of things with MLS team names, where they changed the name of a team at some point (e.g. 'Sporting Kansas City' -> 'Sporting KC'), either during the middle of the season or between seasons.
I'm not sure what the best, general solution is to ensuring name consistency over time. Perhaps, we could re-scrape data like a year after it occurred, assuming that names are no longer being changed at that point. Obviously this would take a lot of time. Perhaps there are shortcuts for checking self-consistency.
Maybe tying the names to a team ID? In the URLS in FBRef teams appear to have an ID
For example, Grenoble Foot appears to have the id: 40aa7280
https://fbref.com/en/squads/40aa7280/Grenoble-Foot-Stats
Maybe that can be scrapped and used to track changes to names?
Competition list is outdated in the README for load_match_comp_results
Some competitions have changed name in the rds files e.g
English Football League Cup
is nowEFL Cup
Copa America too and UEFA Euro compsMore generally it would be nice to have a function listing all the competitions/country/league id which are available for each load functions so we could get the load data programmatically.
Thanks for this great package!