beanumber / openWARData

An R package consisting of data associated with openWAR
7 stars 1 forks source link

updated fWAR to include non-qualifiers also #4

Closed davidbmitchell closed 9 years ago

davidbmitchell commented 9 years ago

Okay so the pull request I submitted yesterday had one problem. The FanGraphs WAR only contained batters that had enough atbats to qualify for the batting title and pitchers that had pitched enough innings to qualify for the ERA title. So obviously that means a lot players were left out. The changes contains all players.

beanumber commented 9 years ago

Thanks!

beanumber commented 9 years ago

BTW, how did you do this? Is there a way to download this data automatically? Or does it require going to the website and submitting a search form?

davidbmitchell commented 9 years ago

I had to manually submit the search form on FanGraphs. I spent a lot of time trying to automatically download it but I couldn't figure it out because the data is on a paginated table. If it wasn't you could write a function using a query string to scrape the data like you can with baseball-reference's data.

davidbmitchell commented 9 years ago

I'm gonna continue to try and figure it out.

davidbmitchell commented 9 years ago

After a lot of playing around with different ideas, I have written a function that can grab FanGraphs WAR data by year. I am going to submit a pull request so it can be added but here is the code.

getfWAR = function(yyyy) {
    # Get the batters fWAR first
    url = paste("http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=6&season=",
                            yyyy,"&month=0&season1=",yyyy,
                            "&ind=0&team=0&rost=0&age=0&filter=&players=0&page=1_10000",
                            sep="")
    # The scrape function from the scrapeR package makes this so much easier
    doc = scrapeR::scrape(url)
    # Find the FanGraphs leaderboard table which holds the data we want
    data = XML::xpathSApply(doc[[1]], '//table[@id="LeaderBoard1_dg1_ctl00"]')
    # This stores the batters' fWAR table. We have specify the classes of each
    # column because R wants to turn them all into factors
    bat = XML::readHTMLTable(data[[1]], colClasses = c("integer", "character", "factor", "numeric", "numeric", "numeric", "numeric", "numeric",  "numeric", "numeric", "numeric", "numeric", "numeric", "character"))
    # We need to change the Base Running column to Base.Running
    colnames(bat)[5] <- "Base.Running"

    # Get the pitchers fWAR
    url = paste("http://www.fangraphs.com/leaders.aspx?pos=all&stats=pit&lg=all&qual=0&type=6&season=",
                            yyyy,"&month=0&season1=",yyyy,
                            "&ind=0&team=0&rost=0&age=0&filter=&players=0&page=1_10000",
                            sep="")
    doc = scrapeR::scrape(url)
    data = XML::xpathSApply(doc[[1]], '//table[@id="LeaderBoard1_dg1_ctl00"]')
    pitch = XML::readHTMLTable(data[[1]], colClasses = c("integer", "character", "factor", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "character"))

    # merge the bat and pitch fWAR tables
    all = merge(x=bat, y=pitch, by = c("Name"), all=TRUE)
    # append a yearId column to data
    all$yearId = yyyy

    # Download the crunchtimebaseball player id map because it is more current
    id.map = read.csv("http://crunchtimebaseball.com/master.csv")
    id = data.frame(Name = id.map$fg_name,  playerId = id.map$fg_id)
    # Merge all and id to add the FanGraphs player IDs to the data
    all = merge(all, id, by="Name")

    out = all %>%
        mutate(fRAA_bat = ifelse(is.na(Batting), 0, Batting) + ifelse(is.na(Positional), 0, Positional)) %>%
        mutate(fRAA_br = ifelse(is.na(Base.Running), 0, Base.Running)) %>%
        mutate(fRAA_field = ifelse(is.na(Fielding), 0, Fielding)) %>%
        mutate(fWAR_pitch = ifelse(is.na(WAR.y), 0, WAR.y)) %>%
        mutate(fRepl = ifelse(is.na(Replacement), 0, -Replacement)) %>%
        mutate(fRAR = ifelse(is.na(RAR.x), 0, RAR.x) + ifelse(is.na(RAR.y), 0, RAR.y)) %>%
        mutate(fRAA = fRAR + fRepl) %>%
        mutate(fWAR = ifelse(is.na(WAR.x), 0, WAR.x) + fWAR_pitch) %>%
        select(playerId, yearId, Name, fWAR, fRAA_bat, fRAA_br, fRAA_field,
                     fWAR_pitch, fRAR, fRAA, fRepl)
    return(out)
}

Notice the "&ind=0&team=0&rost=0&age=0&filter=&players=0&page=1_10000" in the url. The 10000 is what I manipulated from the FanGraphs html query string. This overrides the FanGraphs default size of 30 in the data tables.

I tried making the function possible for multiple years but I think FanGraphs servers won't allow it. You can do multiple years and cumulative WAR per player, and the FanGraphs server will let it through. However, I haven't added that functionality yet, but I will at a later date.

Because FanGraphs data is paginated and using AJAX requests, I think this is the only way to download data directly from them.