Memory Issues - Githubissues

cpsievert / pitchRx

Tools for scraping MLB Gameday data and Visualizing PITCHf/x

http://cpsievert.github.io/pitchRx/

Other

123 stars 33 forks source link

Memory Issues #27

Open kferris10 opened 9 years ago

kferris10 commented 9 years ago

I am running into some errors trying to scrape large amounts of PITCHf/x data on my Windows 7 computer. Here are some screenshots to illustrate

When I start a new R session, I'm only using about 20 MB of memory.

1-pre-scrape

I run this code to scrape several months of PITCHf/x data

library(pitchRx)
library(dplyr)
library(DBI)

# setwd("~/pitchfx")
db <- src_sqlite("pitchfx14.sqlite3", create = T)

update_db(db$con, "2014-12-01")

scrape(start = "2014-01-01", 
end = "2014-04-01", 
suffix = c("inning/inning_all.xml", 
               "inning/inning_hit.xml", 
               "miniscoreboard.xml", 
               "players.xml"), 
connect = db$con)

When this is finished running, the R session is now using almost 1 GB of memory.

2-post-scrape

Running gc() appears to have no effect

3-post-gc

The only solution I have found is to restart R completely

P.S. Sorry if those numbers are impossible to see. Let me know if it would help to improve the quality of any of the screenshots.

colemanconley commented 8 years ago

I'm having what appears to be the exact same issue and read through both #27 above and the referenced issue #22. I tried gc() like you suggested in 22, but it doesn't work on my machine just as it doesn't work above. What is the solution? I can restart R, but usually have to restart my machine for everything to run in a reasonable amount of time.

Related: I was trying to scrape data for all games starting on 03/01/2010 through the present by grabbing only one month at a time. R crashed midway through the games on 5/16/2012, so I restart, load my packages and define my connection, then run:

update_db(mysqlconnection, end="2012-05-20")

This starts getting the games from 5/17/2012 through 5/20, which obviously misses the remaining 5/16 games I didn't get to. How can I get the rest of the 5/16 games now without duplicating what I already have for that day?

kferris10 commented 8 years ago

@colemanconley I've never had an issue with duplicating games when using update_db. My strategy is to first scrape one year of data. Then I can just run update_db one year at a time. Is that not working for you?

myellen commented 7 years ago

I have these memory issues on windows but not on mac. The only way I've found to free up the memory is to restart the R session. What I do is make a new SNOW cluster with one node to run the scrape method each time, which is the same as having a new r session each time.

some code I use

ll <- seq(as.Date(start_date), as.Date(end_date), "1 year")
 ntasks <- length(ll)-1
 for(i in 1:ntasks) {
    print(ll[i])
    print(ll[i+1])]
      cl<-makeCluster(1, type="SOCK", outfile = "")
      clusterEvalQ(cl, library(pitchRx))
      clusterEvalQ(cl, library(DBI))
      clusterEvalQ(cl, library(RSQLite))
      clusterEvalQ(cl, library(dplyr))

      clusterExport(cl, list = c("ll", "files", "dbpath"), envir=environment())
      clusterCall(cl, function(i) {
        db <- src_sqlite(dbpath, create = TRUE)
        scrape(start = ll[i], end = ll[i+1], suffix = files, connect = db$con)
        dbDisconnect(db$con)
      },i)
      stopCluster(cl)
    ]
  }