cpsievert / pitchRx

Tools for scraping MLB Gameday data and Visualizing PITCHf/x
http://cpsievert.github.io/pitchRx/
Other
124 stars 33 forks source link

scrape Error Messages #20

Closed aaronbaggett closed 10 years ago

aaronbaggett commented 10 years ago

Hey Carson, I'm getting a couple of strange error messages when trying to scrape some data. Here's the code I'm using. My session info is also below.

R Code:

setwd("~/Dropbox/pfx_data")
pfx_db <- src_sqlite("pitchRx.sqlite3", create = TRUE)
files <- c("inning/inning_hit.xml", "miniscoreboard.xml", "players.xml")
scrape(start = "2009-01-01", end = "2014-01-01", suffix = files, connect = pfx_db$con)

Error Message:

Failed connect to http://gd2.mlb.com:80 ; Invalid argument

After a fresh session, I got the following message:

Successfully copied coach table to database connection.
Successfully copied player table to database connection.
Successfully copied umpire table to database connection.
Collecting garbage
Error in function (type, msg, asError = TRUE)  : 
  Could not resolve host: gd2.mlb.com

As you can see, it does appear to successfully copy some of the tables to my pfx_db. However, action, atbat, pitch, po, and runner appear to be missing for some reason. However, when I check the tbls, I get the following:

pfx_db
src:  sqlite 3.7.17 [pitchRx.sqlite3]
tbls: coach, game, media, player, umpire

Session Info:

sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] pitchRx_1.4           ggplot2_1.0.0         RSQLite.extfuns_0.0.1 RSQLite_0.11.4       
[5] DBI_0.2-7             dplyr_0.2
cpsievert commented 10 years ago

Before I suggest anything, would you mind showing me what output you get from the following?

library(DBI)
gidz <- unique(dbGetQuery(pfx_db$con, "SELECT DISTINCT gameday_link FROM player")[,1])
head(gidz)
tail(gidz)
gidz2 <- unique(dbGetQuery(pfx_db$con, "SELECT DISTINCT gameday_link FROM game")[,1])
head(gidz2)
tail(gidz2)

Also, the action, atbat, pitch, po, and runner tables are missing because 'inning/inning_all.xml' is not included in files. You can easily add them to what you have so far by doing:

scrape(start = "2009-01-01", end = "2014-01-01", connect = pfx_db$con)
aaronbaggett commented 10 years ago

Thanks. Yeah, I realized I was missing inning_all.xml after I submitted earlier. I should say that I updated my pfx_db and started over with all games played on Sunday just to try and troubleshoot things. Here is what I ran in order to build the test db. There did not appear to be any errors from what I ran below.

pfx_db <- src_sqlite("pitchRx.sqlite3", create = TRUE)
pfx_db
files <- c("inning/inning_all.xml", "players.xml")
scrape(start = "2014-07-20", end = "2014-07-20", suffix = files, connect = pfx_db$con)

Here's my output from your recommendation:

gidz <- unique(dbGetQuery(pfx_db$con, "SELECT DISTINCT gameday_link FROM player")[,1])
head(gidz)
[1] "gid_2014_07_20_cinmlb_nyamlb_1" "gid_2014_07_20_texmlb_tormlb_1"
[3] "gid_2014_07_20_clemlb_detmlb_1" "gid_2014_07_20_sfnmlb_miamlb_1"
[5] "gid_2014_07_20_colmlb_pitmlb_1" "gid_2014_07_20_kcamlb_bosmlb_1"
tail(gidz)
[1] "gid_2014_07_20_tbamlb_minmlb_1" "gid_2014_07_20_seamlb_anamlb_1"
[3] "gid_2014_07_20_balmlb_oakmlb_1" "gid_2014_07_20_chnmlb_arimlb_1"
[5] "gid_2014_07_20_nynmlb_sdnmlb_1" "gid_2014_07_20_lanmlb_slnmlb_1"
gidz2 <- unique(dbGetQuery(pfx_db$con, "SELECT DISTINCT gameday_link FROM game")[,1])
Error in sqliteExecStatement(con, statement, bind.data) : 
  RS-DBI driver: (error in statement: no such table: game)

Looks like gidz2 resulted in an error.

Thanks for your help!

cpsievert commented 10 years ago

I was hoping you'd run that code on the database that you had in your initial report (you shouldn't have to create more than one database). In your initial report, your database had coach, game, media, player, umpire tables, which indicates that the miniscoreboard.xml and players.xml files were correctly parsed and added to your database before the error occurred.

cpsievert commented 10 years ago

I was able to run your original snippet code successfully using pitchRx 1.5

library(dplyr)
library(pitchRx)
pfx_db <- src_sqlite("pitchRx.sqlite3", create = TRUE)
files <- c("inning/inning_hit.xml", "miniscoreboard.xml", "players.xml")
scrape(start = "2009-01-01", end = "2014-01-01", suffix = files, connect = pfx_db$con)

It could be that your internet connection became unstable at some point. For this (and other) reasons, I usually suggest to not scrape more than 1 year's worth of data at a time. In other words,

scrape(start = "2009-01-01", end = "2010-01-01", suffix = files, connect = pfx_db$con)
scrape(start = "2010-01-01", end = "2011-01-01", suffix = files, connect = pfx_db$con)
# and so on
aaronbaggett commented 10 years ago

Thanks, Carson. See you soon at JSM.