cpsievert / pitchRx

Tools for scraping MLB Gameday data and Visualizing PITCHf/x
http://cpsievert.github.io/pitchRx/
Other
123 stars 33 forks source link

Memory Issues #27

Open kferris10 opened 9 years ago

kferris10 commented 9 years ago

I am running into some errors trying to scrape large amounts of PITCHf/x data on my Windows 7 computer. Here are some screenshots to illustrate

1-pre-scrape

2-post-scrape

3-post-gc

P.S. Sorry if those numbers are impossible to see. Let me know if it would help to improve the quality of any of the screenshots.

colemanconley commented 8 years ago

I'm having what appears to be the exact same issue and read through both #27 above and the referenced issue #22. I tried gc() like you suggested in 22, but it doesn't work on my machine just as it doesn't work above. What is the solution? I can restart R, but usually have to restart my machine for everything to run in a reasonable amount of time.

Related: I was trying to scrape data for all games starting on 03/01/2010 through the present by grabbing only one month at a time. R crashed midway through the games on 5/16/2012, so I restart, load my packages and define my connection, then run:

update_db(mysqlconnection, end="2012-05-20")

This starts getting the games from 5/17/2012 through 5/20, which obviously misses the remaining 5/16 games I didn't get to. How can I get the rest of the 5/16 games now without duplicating what I already have for that day?

kferris10 commented 8 years ago

@colemanconley I've never had an issue with duplicating games when using update_db. My strategy is to first scrape one year of data. Then I can just run update_db one year at a time. Is that not working for you?

myellen commented 7 years ago

I have these memory issues on windows but not on mac. The only way I've found to free up the memory is to restart the R session. What I do is make a new SNOW cluster with one node to run the scrape method each time, which is the same as having a new r session each time.

some code I use

ll <- seq(as.Date(start_date), as.Date(end_date), "1 year")
 ntasks <- length(ll)-1
 for(i in 1:ntasks) {
    print(ll[i])
    print(ll[i+1])]
      cl<-makeCluster(1, type="SOCK", outfile = "")
      clusterEvalQ(cl, library(pitchRx))
      clusterEvalQ(cl, library(DBI))
      clusterEvalQ(cl, library(RSQLite))
      clusterEvalQ(cl, library(dplyr))

      clusterExport(cl, list = c("ll", "files", "dbpath"), envir=environment())
      clusterCall(cl, function(i) {
        db <- src_sqlite(dbpath, create = TRUE)
        scrape(start = ll[i], end = ll[i+1], suffix = files, connect = db$con)
        dbDisconnect(db$con)
      },i)
      stopCluster(cl)
    ]
  }