drieslab / Giotto

Spatial omics analysis toolbox
https://drieslab.github.io/Giotto_website/
Other
258 stars 98 forks source link

fast implemenation of make_simulated_network #315

Open andrewrech opened 2 years ago

andrewrech commented 2 years ago

Hi there,

I needed a fast implementation of make_simulated_network for my work. The below is a pure-data.table approach that is 3 or 4 log faster on my 120 core system with O(n) memory usage.

Your implementation of future does not work for me so I've used mclapply here, but you could edit as needed without major differences in performance.

Happy to craft to PR if interest

Thanks

Andrew

# cell type proximity enrichment ####

#' @title make_simulated_network
#' @name make_simulated_network
#' @description Simulate random network.
#' @keywords internal
make_simulated_network = function(gobject,
                                  spat_unit = NULL,
                                  feat_type = NULL,
                                  spatial_network_name = 'Delaunay_network',
                                  cluster_column,
                                  number_of_simulations = 100,
                                  set_seed = TRUE,
                                  seed_number = 1234) {

  # Set feat_type and spat_unit
  spat_unit = set_default_spat_unit(gobject = gobject,
                                    spat_unit = spat_unit)
  feat_type = set_default_feat_type(gobject = gobject,
                                    spat_unit = spat_unit,
                                    feat_type = feat_type)

  # data.table variables
  unified_cells = NULL

  spatial_network_annot = annotateSpatialNetwork(gobject = gobject,
                                                 feat_type = feat_type,
                                                 spat_unit = spat_unit,
                                                 spatial_network_name = spatial_network_name,
                                                 cluster_column = cluster_column)

  # remove double edges between same cells #
  spatial_network_annot = sort_combine_two_DT_columns(spatial_network_annot,
                                                      column1 = 'from', column2 = 'to',
                                                      myname = 'unified_cells')
  spatial_network_annot = spatial_network_annot[!duplicated(unified_cells)]

  # create a simulated network

  all_cell_type = c(spatial_network_annot$from_cell_type, spatial_network_annot$to_cell_type)
  middle_point = length(all_cell_type)/2

  ret <- 1:number_of_simulations %>%
    parallel::mclapply(function(sim) {
    print(sim)

    if(set_seed == TRUE) {
      seed_number = seed_number+sim
      set.seed(seed = seed_number)
    }

    reshuffled_all_cell_type = sample(x = all_cell_type,
                                      size = length(all_cell_type),
                                      replace = F)

    new_from_cell_type = reshuffled_all_cell_type[1:middle_point]

    new_to_cell_type = reshuffled_all_cell_type[(middle_point+1):length(all_cell_type)]

    ret <-
      data.table::data.table(s1 = new_from_cell_type,
                             s2 = new_to_cell_type,
                             round = paste0("sim", sim)
                             )

    ret[, unified_int := paste0(s1, "--", s2)]
    ret[s1 == s2, type_int := "homo"]
    ret[s1 != s2, type_int := "hetero"]

    return(ret)

  }) %>%
  data.table::rbindlist()

  return(ret)
}
RubD commented 2 years ago

Hi Andrew, Thanks a lot for sharing. Looks like a good and concise improvement to me (on a first quick look).

I'm also tagging @jiajic and @mattobny here so that they can also take a look, but feel free to do a PR. We like to give people credit for their work.

We've also ran into some future issues, but at the same time we like the premise that in theory it can be run on any machine or configuration. Perhaps we should extend lapply_flex and have mclapply has one of the options. This is also the one that I always used, but it's not compatible with Windows afaik.

Ruben

andrewrech commented 2 years ago

Agree on all counts re: future. Yes, no mclapply on Windows, unfortunately. Unclear to me how many people will hit this issue of a need for efficient permutation testing. You could always supply a multi-arch Docker container and stop caring about platforms :-)

From a user perspective, mclapply backend in lapply_flex sounds good (that is what I did on my fork once I realized lapply_flex existed).

@jiajic @mattobny the trick above is to build the (sub) data.tables on the fork and then use data.table::rbindlist to avoid extra allocations back on master. data.table handles very long tables with aplomb so this works alright, whereas outside data.table, multi-GB vectors were slow to work with.

That is all I changed, no need for credit (but thanks), please feel free to copy from here if helpful.

RubD commented 1 year ago

I've tested the current implementation with the proposed parallelized version and time gains seems to depend on your hardware and dataset. I believe that the best way forward would be to make a combination of both functions such that you would be able to optimally divide the number of simulations per core.