aertslab / pycisTopic

pycisTopic is a Python module to simultaneously identify cell states and cis-regulatory topics from single cell epigenomics data.
Other
55 stars 10 forks source link

Running pycisTopic on very large datasets [PERFORMANCE] #106

Open simozhou opened 6 months ago

simozhou commented 6 months ago

What type of problem are you experiencing and which function is you problem related too I am running cisTopic on a very large dataset (200k cells) and it takes apparently very long. It has approx 80k regions.

I am running the mallet version of pycisTopic, and the function has these params:

models=run_cgs_models_mallet(path_to_mallet_binary,
                    cistopic_obj,
                    n_topics=[2,5,10,15,20,25,30,35,40,45,50,60,70,80,90,100,150],
                    n_cpu=64,
                    n_iter=500,
                    random_state=420,
                    alpha=50,
                    alpha_by_topic=True,
                    eta=0.1,
                    eta_by_topic=False,
                    tmp_path=tmp_path, #Use SCRATCH if many models or big data set
                    save_path=None)

Is there a way I can speed up computations? At the moment it runs for more than 4 days, and I have plans to run it on an even bigger dataset (1M cells), and I have the feeling I might be doing something wrong, and that maybe I could do something differently (maybe not use Mallet? not sure). Do you have suggestions on this?

The machine it runs on has 64 CPUs and 500GB of RAM available.

Version information pycisTopic: 1.0.3.dev20+g8955c76

SeppeDeWinter commented 5 months ago

Hi @simozhou

This step can take a long time, however 4 days is still a lot.

Did any intermediate models finish in this time, or is it stuck at running the model with 2 topics?

I would also suggest to specify a save_path: Path to save models as independent files as they are completed. This is recommended for large data sets. Default: None.. This will save any intermediate models.

All the best,

Seppe

simozhou commented 5 months ago

Hi @SeppeDeWinter,

Thank you so much for your feedback!

All models do run eventually, although very slowly (2 topics runs faster, then for obvious reasons larger models with more topics are slower).

I will definitely add a save path to avoid recalculating all models every time.

I am providing 450GB of RAM for this job. Do you believe that a larger amount of RAM may help with the speed of computations?

Thanks again and best regards, Simone

SeppeDeWinter commented 5 months ago

Hi @simozhou

450 GB of RAM should be enough. I'm not sure why it's running so slowly for you...

All the best,

Seppe

tiffanywc commented 4 months ago

I am also running mallet with a very large dataset. I have saved intermediate models, in case it terminates before completion. I am wondering how I can combine multiple runs to combine the different topic modelings under mallet.pkl in this case?

SeppeDeWinter commented 3 months ago

Hi @tiffanywc

We store each model as an entry in a list. Some pseudocode below

import os
import pickle

models = []
for file in os.lisdir(<PATH_TO_DIRECTORY_WITH_MODELS>:
   # check wether file is a result from topic modelling, e.g. based on the name
   if file.endswith(".pkl"):
      model = pickle.load(open(os.path.join(<PATH_TO_DIRECTORY_WITH_MODELS>, file), "rb"))
      models.append(model)

I hope this helps?

All the best,

Seppe

TemiLeke commented 2 weeks ago

Hello @simozhou,

I'm wondering if you managed to find a resolution, because I'm currently facing a similar challenge:

Despite the seemingly small number of topics and substantial computational resources, the process is taking an unexpectedly long time. Have you encountered any solutions or optimizations that might help in this scenario? Any insights or workarounds you've discovered would be greatly appreciated.

Thank you!

simozhou commented 2 weeks ago

Hi @TemiLeke,

In short, no, I have not yet solved my time problem. There are a few improvements that helped make it at least tolerable.

  1. Saving topics at every iteration helped a lot to avoid re-running the whole experiment if something failed (usually a TIMEOUT error from the HPC 😅)
  2. Setting reuse_corpus=True also helped a lot, as I have realised that the mallet compressed object was re-written every time, and this saved some time.
  3. If you are working on an HPC, make sure that the number of nodes you are using is not more than one. The algorithm is not optimised to work in a distributed fashion and this would make things much slower than they should. I was running cisTopic with 128 CPUs, only to realise that all nodes on my HPC had 64 CPUs, paradoxically slowing computations down!

This is the code I'm currently using:

# this would be the first time we run cisTopic on this data set
models=run_cgs_models_mallet(path_to_mallet_binary,
                    cistopic_obj,
                    n_topics=[10,15,20,50,60,70,80,90,100,150,200],
                    n_cpu=64,
                    n_iter=500,
                    random_state=420,
                    alpha=50,
                    alpha_by_topic=True,
                    eta=0.1,
                    eta_by_topic=False,
                    tmp_path=tmp_path, #Use SCRATCH if many models or big data set
                    save_path=os.path.join(args.outdir, 'models'),
                    reuse_corpus=True)

I would like to point out that the computational time is still very slow, and it would be good to address this problem. I have been running my 1 million cells dataset and it took 8 days of computations to run with the aforementioned parameters. (which was kinda foreseen, but it would be ideal to shorten this time for the next iteration if possible :) )

@SeppeDeWinter is there something we can do to help? I would be happy to contribute and possibly figure out why this is so slow!

TemiLeke commented 2 weeks ago

Thanks a lot for the detailed reply @simozhou. I'm currently trying this out. I unfortunately only have access to a 40-core system, so it would even take longer.

I agree it would be good to address the problem, and I'd be very happy to contribute in any capacity. @SeppeDeWinter