CollinWoo / daynight-Q10

MIT License
2 stars 0 forks source link

Code Efficiency #26

Open CollinWoo opened 3 years ago

CollinWoo commented 3 years ago

The code I wrote to extract the moisture from multiple files (SMOStif) takes about three minutes to run with the large dataset of 100,000 timestamps. Finding a way to reduce this runtime would be desirable. The code is pasted below:


SMOStif <- function(datevec, lon, lat){
  tifiles <- list.files("LO3_tif")
  dateFiles <- list()
  fileDates <- unique(as.Date(datevec))

  for(date in fileDates){
    probDate <- c(date-1, date, date+1)
    probDate %>% lapply(as.Date) %>% lapply(format, "%Y%m%d") -> probDate

    probDate[1] <- paste0(probDate[1], "_")
    probDate[3] <- paste0("_",probDate[3])
    patterns <- paste(probDate, collapse="|")
    matchfiles <- tifiles[grep(patterns, tifiles)]

    if(length(matchfiles) >0){
      matchfiles <- paste0("./LO3_tif/", matchfiles)
    }

    fileRow <- data.frame(Surface.File = matchfiles[1], Subsurface.File = matchfiles[2])
    dateFiles[[toString(as.Date(date))]] <- fileRow
  }

  bind_rows(dateFiles, .id = "Date") %>% 
    distinct(Surface.File, .keep_all = TRUE) %>% 
    filter(!is.na(Surface.File)) ->  dateFiles

  #I can't use mutate on the raster function so I thought this for loop was the next best option
  if(nrow(dateFiles) > 0){
    for(value in 1:nrow(dateFiles)){
      dateFiles[value, 4] = extractSMOS(dateFiles[[value, 2]], lon, lat)
      dateFiles[value, 5] = extractSMOS(dateFiles[[value, 3]], lon, lat)
    }
  }

  dateFiles %>% rename(Surface = V4, Subsurface = V5) %>%
    mutate(Lon = lon, Lat = lat)-> dateFiles
  print(dateFiles)
  return(dateFiles)
}
bpbond commented 3 years ago

@10aDing to test the profiler and get back to us!

CollinWoo commented 3 years ago

@bpbond I'm not completely sure about how profiling works, but I ran a profile for SMOStif for 12 seconds and it seems that the majority of the time was taken up by the call to the extractSMOS function. The extractSMOS calculation lasts for about 8 seconds in total: image

bpbond commented 3 years ago

Okay. Is the GitHub code fully updated for me to pull?

CollinWoo commented 3 years ago

The moisture branch should be up to date. I think I fixed the issue with the nonexistent column error as well.

bpbond commented 3 years ago

@10aDing

Your SMOStifnames function currently takes a vector of dates and constructs filenames from them. It does so for every individual date, even though we know there's going to be lots of overlap:

Browse[2]> length(datevec)
[1] 86479
Browse[2]> length(unique(datevec))
[1] 355

In other words, for the hirano example there are 86,479 dates...but only 355 unique dates. 99.6% of the work you're doing here is completely unnecessary. Do you understand this?

I will open a PR to improve this.