cpsievert / pitchRx

Tools for scraping MLB Gameday data and Visualizing PITCHf/x
123 stars 33 forks source link

Memory Issues #27

Open kferris10 opened 9 years ago

kferris10 commented 9 years ago

I am running into some errors trying to scrape large amounts of PITCHf/x data on my Windows 7 computer. Here are some screenshots to illustrate




P.S. Sorry if those numbers are impossible to see. Let me know if it would help to improve the quality of any of the screenshots.

colemanconley commented 8 years ago

I'm having what appears to be the exact same issue and read through both #27 above and the referenced issue #22. I tried gc() like you suggested in 22, but it doesn't work on my machine just as it doesn't work above. What is the solution? I can restart R, but usually have to restart my machine for everything to run in a reasonable amount of time.

Related: I was trying to scrape data for all games starting on 03/01/2010 through the present by grabbing only one month at a time. R crashed midway through the games on 5/16/2012, so I restart, load my packages and define my connection, then run:

update_db(mysqlconnection, end="2012-05-20")

This starts getting the games from 5/17/2012 through 5/20, which obviously misses the remaining 5/16 games I didn't get to. How can I get the rest of the 5/16 games now without duplicating what I already have for that day?

kferris10 commented 8 years ago

@colemanconley I've never had an issue with duplicating games when using update_db. My strategy is to first scrape one year of data. Then I can just run update_db one year at a time. Is that not working for you?

myellen commented 7 years ago

I have these memory issues on windows but not on mac. The only way I've found to free up the memory is to restart the R session. What I do is make a new SNOW cluster with one node to run the scrape method each time, which is the same as having a new r session each time.

some code I use

ll <- seq(as.Date(start_date), as.Date(end_date), "1 year")
 ntasks <- length(ll)-1
 for(i in 1:ntasks) {
      cl<-makeCluster(1, type="SOCK", outfile = "")
      clusterEvalQ(cl, library(pitchRx))
      clusterEvalQ(cl, library(DBI))
      clusterEvalQ(cl, library(RSQLite))
      clusterEvalQ(cl, library(dplyr))

      clusterExport(cl, list = c("ll", "files", "dbpath"), envir=environment())
      clusterCall(cl, function(i) {
        db <- src_sqlite(dbpath, create = TRUE)
        scrape(start = ll[i], end = ll[i+1], suffix = files, connect = db$con)