cpsievert / pitchRx

Tools for scraping MLB Gameday data and Visualizing PITCHf/x
http://cpsievert.github.io/pitchRx/
Other
124 stars 33 forks source link

home_team_runs and away_team_runs returned NA #17

Closed vegas31 closed 10 years ago

vegas31 commented 10 years ago

I am using pitchRx and scrape to look at some data related to what pitches a pitcher uses, given the score of the game. In order to do this, I am looking at the home_team_runs and away_team_runs columns in GameDay data, which pitchRx/scrape provides. However, I am encountering a lot of NA's in my data when the values are actually there, when I search 'home_team_runs' on gd2.mlb.com in the relevant xml file.

Here are my commands: library(dplyr) library(pitchRx) june8 <- scrape(start = "2014-06-08", end = "2014-06-08")

I was mostly interested in the WAS/SDN game, which returned all NA for Jordan Zimmermann; looking at different games on June 8 and also games on different days gives me for the most part the same results -- there are some sporadic entries (see screenshot attached, which are the results of doing a View(june8$atbat))

I am on a OS X 10.9.3, and using pitchRx version 1.5 on R Studio Version 0.98.501.

Happy to pass along any other info you need if I have forgotten anything -- many thanks!

Stuart

screen shot 2014-06-10 at 1 13 26 pm

cpsievert commented 10 years ago

That is to be expected -- these values are missing (in the source files) unless runs are scored during the atbat.

I admit this is not the best data format. You probably want the running totals (without NAs).

vegas31 commented 10 years ago

Ahh, thanks for the clarification -- I was making a bad assumption about what those fields meant.

The commands you provide work for the most part -- looking at a subset of data (WAS/SDN), it appears it gets the home values correct (here, it's 0 for the entire game), but then it starts adding 1's after a point. I am looking to see if there's a particular reason why it changes over, but haven't found a trend yet.

Thanks again for your help!

cpsievert commented 10 years ago

Here is a method to convert home_team_runs/away_team_runs to the equivalent numeric representation.

library(pitchRx)
june8 <- scrape(start = "2014-06-08", end = "2014-06-08")
atbats <- june8$atbat

library(dplyr)
# make sure records are ordered by num (within game)
atbats <- split(atbats, atbats$gameday_link) %>%
            lapply(., function(x) x[order(x$num), ]) %>%
            rbind_all
# replace missing values with the next non-missing value
f <- function(runs) {
  runs <- as.numeric(runs)
  idx <- which(!is.na(runs))
  rep(runs[idx], diff(c(0, idx)))
}
atbats$home_team_runs <- unlist(with(atbats, tapply(home_team_runs, INDEX = gameday_link, f)))
atbats$away_team_runs <- unlist(with(atbats, tapply(away_team_runs, INDEX = gameday_link, f)))