cpsievert / pitchRx

Tools for scraping MLB Gameday data and Visualizing PITCHf/x
http://cpsievert.github.io/pitchRx/
Other
124 stars 33 forks source link

URL malformed #21

Closed konrad1234 closed 10 years ago

konrad1234 commented 10 years ago

Hey Carson,

When I run the following command:

scrape(start = "2014-01-01", end = Sys.Date(), connect=con)

I get this error:

Error in function (type, msg, asError = TRUE) : <url> malformed

I ran it through your dev build and got this result:

Error in strsplit(gids[length(gids)], split = "_") : 
internal error -3 in R_decompress1 
3 strsplit(gids[length(gids)], split = "_") at scrape.R#333
2 makeUrls(start = start, end = end) at scrape.R#101
1 scrape(start = "2014-01-01", end = "2014-08-01") 

When I run the debug I get this:

function(start, end, gids="infer") {
root <- "http://gd2.mlb.com/components/game/mlb/"
  if (all(gids %in% "infer")) {
    if (missing(start) || missing(end)) {
      warning("Can't 'infer' game urls without start/end date.")
      return(root)
    } else {
      start <- as.POSIXct(start)
      end <- as.POSIXct(end)
      env <- environment()
      data(gids, package="pitchRx", envir=env)
      last.game <- strsplit(gids[length(gids)], split="_")[[1]]
      last.date <- as.POSIXct(paste(last.game[2], last.game[3], last.game[4], sep="-"))
      #need to rework this guy
      #if (last.date < end) gids <- c(gids, updateGids(max(start, last.date), end))
      return(gids2urls(subsetGids(gids, first=start, last=end)))
    }
  } else {
    gidz <- gids[grep("gid", gids)]
    if (length(gidz) != length(gids)) {
      #message("The option gids was ignored since some values did not contain 'gid'")
      return(paste0(root, dates2urls(as.POSIXct(start), as.POSIXct(end))))
    } else {
      return(gids2urls(gidz))
    }
  }
}

With this line highlighted:

last.game <- strsplit(gids[length(gids)], split="_")[[1]]

Is the URL being built incorrectly? Any insight you can provide would be great.

Thanks!

cpsievert commented 10 years ago

I try to discourage people from using end = Sys.Date() as it can lead to unintended consequences (files are updated in real time -- so there is a chance of http errors and/or obtaining duplicate records). I'm guessing that if you use end = Sys.Date() - 1, you won't receive an error. Please give that a try a let me know how it goes. Thanks!

konrad1234 commented 10 years ago

I upgraded to the latest version of R and the problem has disappeared. I'll make a note to use your recommendation. Thanks for the help!

cpsievert commented 10 years ago

Ah, good to know, thanks for reporting back!

cpsievert commented 10 years ago

Just curious -- what version were you using?

konrad1234 commented 10 years ago

I believe it was 2.15.3