JaseZiv / worldfootballR

A wrapper for extracting world football (soccer) data from FBref, Transfermark, Understat
https://jaseziv.github.io/worldfootballR/
433 stars 59 forks source link

`tm_player_transfer_history()` failing due to not being available in the HTML of transfermarkt #342

Closed JaseZiv closed 7 months ago

JaseZiv commented 8 months ago

Without using some form of browser automation, player transfer histories are no longer able to be scraped by tm_player_transfer_history() in its current form.

Will open this issue and try to incorporate the work @tonyelhabr did using chromote to obtain certain FBREF data points.

tonyelhabr commented 7 months ago

Without using some form of browser automation, player transfer histories are no longer able to be scraped by tm_player_transfer_history() in its current form.

Will open this issue and try to incorporate the work @tonyelhabr did using chromote to obtain certain FBREF data points.

I tried out the chromote approach and found that I'm getting blocked upon loading a player URL.

session <- worldfootballR:::worldfootballr_chromote_session("https://www.transfermarkt.com/cristiano-ronaldo/profil/spieler/8198")
session$session$view()

image

I did find that there is an API call that we can make to get some of the transfer history elements, although I'm not sure how we'll get some things like from_country and to_country.

library(worldfootballR)
library(httr)
#> Warning: package 'httr' was built under R version 4.2.3
headers = c(
  `User-Agent` = getOption("worldfootballR.agent")
)

res <- httr::GET(
  url = "https://www.transfermarkt.com/ceapi/transferHistory/list/8198",
  httr::add_headers(.headers = headers)
)

cont <- content(res)
transfers <- cont$transfers
str(transfers[1:2], max.level = 2)
#> List of 2
#>  $ :List of 12
#>   ..$ url               : chr "/cristiano-ronaldo/transfers/spieler/8198/transfer_id/4197140"
#>   ..$ from              :List of 7
#>   ..$ to                :List of 7
#>   ..$ futureTransfer    : int 0
#>   ..$ date              : chr "Jan 1, 2023"
#>   ..$ dateUnformatted   : chr "2023-01-01"
#>   ..$ upcoming          : logi FALSE
#>   ..$ season            : chr "22/23"
#>   ..$ marketValue       : chr "€20.00m"
#>   ..$ fee               : chr "-"
#>   ..$ showUpcomingHeader: logi FALSE
#>   ..$ showResetHeader   : logi FALSE
#>  $ :List of 12
#>   ..$ url               : chr "/cristiano-ronaldo/transfers/spieler/8198/transfer_id/4152208"
#>   ..$ from              :List of 7
#>   ..$ to                :List of 7
#>   ..$ futureTransfer    : int 0
#>   ..$ date              : chr "Nov 22, 2022"
#>   ..$ dateUnformatted   : chr "2022-11-22"
#>   ..$ upcoming          : logi FALSE
#>   ..$ season            : chr "22/23"
#>   ..$ marketValue       : chr "€20.00m"
#>   ..$ fee               : chr "-"
#>   ..$ showUpcomingHeader: logi FALSE
#>   ..$ showResetHeader   : logi FALSE
tonyelhabr commented 7 months ago

Upon a GitHub search, I found that a python package made a similar fix in the past 2 weeks. Here is their code for scraping history.

tonyelhabr commented 7 months ago

Oh, so I think we can still get the "extra info" from server-side loaded data. So we may actually be capable of returning the same data from the function as before.