crittersVStubes_OTU.R is extremely slow and does not run in parallel

cbird808 commented 5 years ago

while I suspect there are other ways of improving the speed, time consuming steps can be parallelized. Note that the readme will have to be updated to include the parallel package in R.

Replace apply with parApply: apply (X = charon, MARGIN = 1, function (x) {assign (x[[1]], x[[2]], as.numeric (x[[4]]), x[[5]], as.numeric (x[[6]]) ) })

#parallel version of apply library(parallel) cl <- makeCluster(detectCores()) parApply (cl=cl, charon, 1, function (x) {assign (x[[1]], x[[2]], as.numeric (x[[4]]), x[[5]], as.numeric (x[[6]]) ) }) stopCluster(cl)

cbird808 commented 5 years ago

here's another time consuming step:

# Use the taxonimic rank and the TAXID as coordinates to assign the scientific name # in the appropriate field for (i in 1:(length (higherTaxa))){ for (j in 1:(length (higherTaxa[[i]]))){ if (is.na (attributes (higherTaxa[[i]][j])$names)){ break }

if (attributes (higherTaxa[[i]][j])$names != "no rank"){ # Skip "no rank" colIdx <- attributes (higherTaxa[[i]][j])$names full[i,colIdx] <- as.character (sciname (id = as.numeric (higherTaxa[[i]][j]), taxdir = TAXDIR, names = ncbi_names)) } } }

#parallel version fillTax <- function (i, TAXDIR) { for (j in 1:(length (higherTaxa[[i]]))){ if (is.na (attributes (higherTaxa[[i]][j])$names)){ break }

if (attributes (higherTaxa[[i]][j])$names != "no rank"){ # Skip "no rank" colIdx <- attributes (higherTaxa[[i]][j])$names full[i,colIdx] <- as.character (sciname (id = as.numeric (higherTaxa[[i]][j]), taxdir = TAXDIR, names = ncbi_names)) } } } cl <- makeCluster(detectCores()) clusterExport(cl,"TAXDIR") #clusterExport(cl=cl, varlist=c("text.var", "ntv") parLapply(cl, 1:length(higherTaxa), function(x) fillTax(x,TAXDIR) ) stopCluster(cl)

cbird808 commented 5 years ago

I've started improving this. Have streamlined the processing of charon and creation of CVT. have started using furrr to parallelize the time consuming tasks

ekrell commented 5 years ago

Okay perfect. This has been on my list for long time. The script for counting OTUs (bin/CROP_size_fix.sh) is also nasty slow and trivially parallel. As in, the script itself could just be called in parallel on a subset of the data.

cbirdlab / charybdis

crittersVStubes_OTU.R is extremely slow and does not run in parallel #23