JaseZiv / worldfootballR_data

Project holding various data for the worldfootballR R package
62 stars 19 forks source link

Transfermarkt league seasons appearing to be inconsistent for some leagues #14

Open JaseZiv opened 1 year ago

JaseZiv commented 1 year ago

Transfermarkt league seasons appear inconsistent with the html of those same values.

The get_transfermarkt_metadata.R file here uses the below code to get a list of available seasons for the particular league:

comp_url <- "https://www.transfermarkt.com/campeonato-brasileiro-serie-a/startseite/wettbewerb/BRA1"
league_page <- xml2::read_html(comp_url)

seasons <- league_page %>% rvest::html_nodes(".chzn-select") %>% rvest::html_nodes("option")

Which returns the following values:

# {xml_nodeset (27)}
# [1] <option selected value="2022">2023</option>\n
# [2] <option value="2021">2022</option>\n
# [3] <option value="2020">2021</option>\n
# [4] <option value="2019">2020</option>\n
# [5] <option value="2018">2019</option>\n
# [6] <option value="2017">2018</option>\n
# [7] <option value="2016">2017</option>\n
# [8] <option value="2015">2016</option>\n
# [9] <option value="2014">2015</option>\n
# [10] <option value="2013">2014</option>\n
# [11] <option value="2012">2013</option>\n
# [12] <option value="2011">2012</option>\n
# [13] <option value="2010">2011</option>\n
# [14] <option value="2009">2010</option>\n
# [15] <option value="2008">2009</option>\n
# [16] <option value="2007">2008</option>\n
# [17] <option value="2006">2007</option>\n
# [18] <option value="2005">2006</option>\n
# [19] <option value="2004">2005</option>\n
# [20] <option value="2003">2004</option>\n
# ...

To get the values we need, we use the below:

season_start_year <- c()
  for(each_season in seasons) {
    season_start_year <- c(season_start_year, xml2::xml_attrs(each_season)[["value"]])
  }

Which gives us:

[1] "2022" "2021" "2020" "2019" "2018" "2017"
 [7] "2016" "2015" "2014" "2013" "2012" "2011"
[13] "2010" "2009" "2008" "2007" "2006" "2005"
[19] "2004" "2003" "2002" "2001" "2000" "1998"
[25] "1997" "1996" "1995"

This is fine, however for the current the Brasileiro Séria A season (2023), the season URL uses 2022.

Users will need to be aware of this until we find a work around that works for both 'correct' and 'incorrect' seasons...