cpsievert / pitchRx

Tools for scraping MLB Gameday data and Visualizing PITCHf/x
http://cpsievert.github.io/pitchRx/
Other
124 stars 33 forks source link

Low memory error/crash #22

Closed johnchoiniere closed 9 years ago

johnchoiniere commented 10 years ago

Was building a pitchf/x db from scratch, using pitchrx. I repeatedly had system crashes from low memory. It was more frequent using Rstudio, but happened both when using Rstudio and when running the script from the command line. In RStudio it would make it roughly a year, cmd roughly two years before crashing.

Code I was running:

library(pitchRx)
library(dplyr)
files <- c("inning/inning_all.xml","inning/inning_hit.xml", "miniscoreboard.xml", "players.xml")
db <- src_mysql("pitchrx", host = NULL, port = [redacted], user = "root", password = "[redacted]")
scrape(start = "2008-01-01", end = "2008-12-31", suffix = files, connect = db$con)
scrape(start = "2009-01-01", end = "2009-12-31", suffix = files, connect = db$con)
scrape(start = "2010-01-01", end = "2010-12-31", suffix = files, connect = db$con)
scrape(start = "2011-01-01", end = "2011-12-31", suffix = files, connect = db$con)
scrape(start = "2012-01-01", end = "2012-12-31", suffix = files, connect = db$con)
scrape(start = "2013-01-01", end = "2013-12-31", suffix = files, connect = db$con)
scrape(start = "2014-01-01", end = Sys.Date()-1, suffix = files, connect = db$con)

I was able to work around the issue by querying for gameday_link, sorting so the most recent date was found, and deleting rows from all tables where that date was part of the link and then modifying the code to start at that date.

cpsievert commented 10 years ago

Thanks @johnchoiniere. I actually haven't used a mysql connection with scrape yet, but I have a feeling your issues were a consequence of your machine having insufficient memory to pull an entire year at once. I'm hoping to have a more elegant solution for memory management in future versions.

johnchoiniere commented 10 years ago

Is there a way to clear any memory the script is using between years? Or is the solution just to run independent scripts for each year?

cpsievert commented 10 years ago

You could try using gc after scrape is done if you don't want to restart the session.

johnchoiniere commented 10 years ago

Thanks! On Aug 7, 2014 2:13 PM, "Carson" notifications@github.com wrote:

You could try using gc after scrape is done if you don't want to restart the session.

— Reply to this email directly or view it on GitHub https://github.com/cpsievert/pitchRx/issues/22#issuecomment-51532670.

cpsievert commented 9 years ago

Closing since this is a duplicate of #27 (which has a more complete report of the issue).