cpsievert / pitchRx

Tools for scraping MLB Gameday data and Visualizing PITCHf/x
http://cpsievert.github.io/pitchRx/
Other
123 stars 33 forks source link

Errors with Database Connection and Partitioning #26

Closed gs5 closed 9 years ago

gs5 commented 9 years ago

I'm using pitchRx to scrape all Gameday data on a yearly basis and put it into an empty MySQL database (entitled "pitchrx"). I've done 2014 and then 2013 individually, and I've gotten a few errors. After loading the RMySQL, DBI,dplyr,pitchRx, and ggplot2 packages in R, I've run the following code:

mydb <- dbConnect(MySQL(), user='root', password='[my password]', dbname='pitchrx', host='localhost')

files <- c("inning/inning_all.xml","inning/inning_hit.xml","miniscoreboard.xml", "players.xml")

scrape(start = "2014-01-01", end = "2014-12-31", connect=mydb, suffix = files)

Upon a quick look, the "coach," "game," "hip," "media", "pitch," "player," "po," "runner," and "umpire" tables came out fine. As pitchRx scrapes, I frequently get messages saying that those tables were successfully copied to the database connection. But I get the following error message when it comes to "atbats":

Failed to copy atbat_export table to database connection. Writing atbat_export-2015-05-12-16-56-59.csv instead.

The result is 12 timestamped csv files sent to my working directory, containing the vast majority of 2014 plate appearance data (and then 13 total csv files for 2013 PA data). In sum, the "atbats" table contains data from 2/21/13 - 3/21/13, and then 2/25/14 - 3/10/14. An "atbats_export" table has also appeared in my pitchRx database, containing data from 3/21/13 - 4/5/13, 3/10/14 - 3/24/14, and 7/5/14 - 7/23/14.

When each year completes scraping, I get 28 duplicated warnings that read like this:

1: In export(connect, name = i, value = tables[[i]], template = fields[[i]]) : The value data.frame has variables that are not in the corresponding table. Writing data.frame to a new table instead.

In the same vein, "action" is separated into "action" and "action_export" tables. The only difference between the two is that "action" has chiefly spring training data (like "atbats," it only contains data from 2/21/13 - 3/21/13 and 2/25/14 - 3/10/14). The "action_export" table contains all the regular season data and some spring training games.

Should I be doing something differently?

cpsievert commented 9 years ago

Hmm, that's odd, and without a reproducible example, all I can do is guess; but rather than going down that rabbit hole, you'd probably be better off installing the newest version of pitchRx and starting from scratch with a new database.

gs5 commented 9 years ago

I took a deeper look at the issues. There are column differences between the 3 table groups ("atbat", "atbat_export," and the CSV files):

--"atbat_export" has all the columns that "atbat" has, and there's one extra column at the end, titled "event2_es." --The CSV files have all the columns that "atbat_export" and "atbat" have, and there's one extra column at the end, titled "event3_es."

It's the same issue when it comes to the "action" tables. The "action_export" table has all the columns found in the "action" table, but that extra column is once again added in at the end: "event2_es."

Does that help? It seems like the scraper is interpreting specific game data as though Spanish description columns are missing. That would explain the warning messages. The result is wrongfully separated tables.

cpsievert commented 9 years ago

Yea, you'll have to add those columns to the database tables so that the columns match up. Since you're using MySQL, I think you can do that with:

DBI::dbSendQuery(mydb, "ALTER TABLE atbat ADD event2_es VARCHAR(60);")
DBI::dbSendQuery(mydb, "ALTER TABLE atbat ADD event3_es VARCHAR(60);")

http://www.tech-recipes.com/rx/378/add-a-column-to-an-existing-mysql-table/

You'll also probably also want to append the atbat_export table to atbat, then get rid of atbat_export.

ryan112358 commented 9 years ago

I'm having a similar problem but I'm using an SQLite database. Scraper works flawlessly for 2012 data and before, but for 2013 and 2014, i'm getting a bunch of

Failed to copy atbat_export table to database connection. Writing atbat_export-2015-05-28-13-34-45.csv instead.

errors. Here is the relevant code:

library(pitchRx)
library(RSQLite)
library(dplyr)

db2013 <- src_sqlite("Gameday2013", create=TRUE)
scrape(start="2013-01-01", end="2013-12-31", suffix="inning/inning_all.xml", connect = db2013$con)

It appears that the format of the xml data changes mid-season which I think is causing the errors. Any chance there is a more permanent fix than manually updating and merging the tables?

Btw, thanks for creating this awesome package!