ArgoCanada / argoFloats

Tools for analyzing collections of oceanographic Argo floats
https://argocanada.github.io/argoFloats/index.html
17 stars 7 forks source link

way to speed up getIndex() a little #582

Closed dankelley closed 2 years ago

dankelley commented 2 years ago

@j-harbin, I think parts of argoFloats can be speeded up. The issue is that gsub() can be slow, unless perl=TRUE is supplied. A demonstration, using f as index$file in an argoFloatsIndex object) is:

# comments give user time in seconds
system.time(ID1<-gsub(".*/(.*)_.*", "\\1", f))                # 6.892
system.time(ID2<-gsub(".*/(.*)_.*", "\\1", f, perl=TRUE))     # 2.753
system.time(ID3<-gsub(".*/(.*)_.*", "\\1", f, useBytes=TRUE)) # 6.907

You might not get the same results, because I am on the beta version of R, which is handling strings in a new (and slower) way. But, in a few weeks, that beta version will be the normal version, and so users will be on that version in a month or so, assuming they update frequently (which I think many do).

For context, trimming 4 seconds out of 7 seconds is not a lot, but it's not nothing, either. And I think we might have several instances in which we do that. When I run

git grep -n gsub | grep -v perl

in the argoFloats/R directory, I get as below. Notice that we do this for both ID and for cycle. (We also seem to be repeating things, but I've not examined the code to see if that's true.)

AllClass.R:339:#' data.frame(file=gsub(".*/", "", index5[["file"]][1]),
AllClass.R:394:                      ## cycle <- gsub("^.*[/\\\\][A-Z]*[0-9]*_([0-9]{3,4}[D]{0,1})\\.nc$", "\\1", x@data$index$file)
AllClass.R:401:                      #test told <- system.time({IDold <- gsub("^.*[/\\\\][A-Z]*([0-9]*)_[0-9]{3,4}[A-Z]*\\.nc$", "\\1", x@data$index$file)})
AllClass.R:407:                      ID <- gsub("^.*[/\\\\][A-Z]*([0-9]*)_[A-Z]*traj*\\.nc$", "\\1", x@data$index$file)
AllClass.R:426:                      cycle <- gsub("^.*[/\\\\][A-Z]*[0-9]*_([0-9]{3,4})[A-Z]*\\.nc$", "\\1", x[["file"]])
AllClass.R:429:                      ID <- gsub("^.*[/\\\\][A-Z]*([0-9]*)_[0-9]{3,4}[A-Z]*\\.nc$", "\\1", x[["file"]])
AllClass.R:458:                      cycle <- gsub("^.*[/\\\\][A-Z]*[0-9]*_([0-9]{3,4})[A-Z]*\\.nc$", "\\1", unlist(x[["filename"]]))
adjusted.R:12:    typeFromFilename <- switch(substring(gsub(".*/","",fn),1,1), "A"="adjusted", "D"="delayed", "R"="realtime")
get.R:127:        destfile <- gsub(".*/(.*).nc", "\\1.nc", url)
get.R:348:        destfileRda <- gsub(".txt.gz$", ".rda", destfile)
get.R:350:        destfileRda <- gsub(".txt$", ".rda", destfile)
get.R:423:    ftpRoot <- gsub("^[^:]*:[ ]*(.*)$", "\\1", first[which(grepl("^# FTP", first))])
get.R:527:        to <- paste0(destdir, "/", gsub(".*/", "", url[iurlSuccess]))
read.R:56:## file <- gsub(".*/", "",  profiles[[1]])
read.R:160:            fileNames <- gsub(".*/(.*).nc", "\\1.nc", profiles@data$file[!mustSkip])
subset.R:505:                    xcycle <- as.integer(gsub("AD","",xcycle)) # change e.g. "123D" to "123"
github-actions[bot] commented 2 years ago

The Stale-bot has marked this issue as Stale, because no new comments have been added in the past 30 days. Unless a comment is added within the next 7 days, the Stale-bot will close the issue. The purpose of these automated actions is to prevent the developers from forgetting about unattended tasks. Note that adding a "pinned" label will turn this action off for a given issue.

dankelley commented 2 years ago

Perhaps this should be examined before we let it close.

github-actions[bot] commented 2 years ago

The Stale-bot has marked this issue as Stale, because no new comments have been added in the past 30 days. Unless a comment is added within the next 7 days, the Stale-bot will close the issue. The purpose of these automated actions is to prevent the developers from forgetting about unattended tasks. Note that adding a "pinned" label will turn this action off for a given issue.