Open simozhou opened 6 months ago
Hi @simozhou
This step can take a long time, however 4 days is still a lot.
Did any intermediate models finish in this time, or is it stuck at running the model with 2 topics?
I would also suggest to specify a save_path
: Path to save models as independent files as they are completed. This is recommended for large data sets. Default: None.
. This will save any intermediate models.
All the best,
Seppe
Hi @SeppeDeWinter,
Thank you so much for your feedback!
All models do run eventually, although very slowly (2 topics runs faster, then for obvious reasons larger models with more topics are slower).
I will definitely add a save path to avoid recalculating all models every time.
I am providing 450GB of RAM for this job. Do you believe that a larger amount of RAM may help with the speed of computations?
Thanks again and best regards, Simone
Hi @simozhou
450 GB of RAM should be enough. I'm not sure why it's running so slowly for you...
All the best,
Seppe
I am also running mallet with a very large dataset. I have saved intermediate models, in case it terminates before completion. I am wondering how I can combine multiple runs to combine the different topic modelings under mallet.pkl in this case?
Hi @tiffanywc
We store each model as an entry in a list. Some pseudocode below
import os
import pickle
models = []
for file in os.lisdir(<PATH_TO_DIRECTORY_WITH_MODELS>:
# check wether file is a result from topic modelling, e.g. based on the name
if file.endswith(".pkl"):
model = pickle.load(open(os.path.join(<PATH_TO_DIRECTORY_WITH_MODELS>, file), "rb"))
models.append(model)
I hope this helps?
All the best,
Seppe
Hello @simozhou,
I'm wondering if you managed to find a resolution, because I'm currently facing a similar challenge:
Despite the seemingly small number of topics and substantial computational resources, the process is taking an unexpectedly long time. Have you encountered any solutions or optimizations that might help in this scenario? Any insights or workarounds you've discovered would be greatly appreciated.
Thank you!
Hi @TemiLeke,
In short, no, I have not yet solved my time problem. There are a few improvements that helped make it at least tolerable.
reuse_corpus=True
also helped a lot, as I have realised that the mallet compressed object was re-written every time, and this saved some time.This is the code I'm currently using:
# this would be the first time we run cisTopic on this data set
models=run_cgs_models_mallet(path_to_mallet_binary,
cistopic_obj,
n_topics=[10,15,20,50,60,70,80,90,100,150,200],
n_cpu=64,
n_iter=500,
random_state=420,
alpha=50,
alpha_by_topic=True,
eta=0.1,
eta_by_topic=False,
tmp_path=tmp_path, #Use SCRATCH if many models or big data set
save_path=os.path.join(args.outdir, 'models'),
reuse_corpus=True)
I would like to point out that the computational time is still very slow, and it would be good to address this problem. I have been running my 1 million cells dataset and it took 8 days of computations to run with the aforementioned parameters. (which was kinda foreseen, but it would be ideal to shorten this time for the next iteration if possible :) )
@SeppeDeWinter is there something we can do to help? I would be happy to contribute and possibly figure out why this is so slow!
Thanks a lot for the detailed reply @simozhou. I'm currently trying this out. I unfortunately only have access to a 40-core system, so it would even take longer.
I agree it would be good to address the problem, and I'd be very happy to contribute in any capacity. @SeppeDeWinter
What type of problem are you experiencing and which function is you problem related too I am running cisTopic on a very large dataset (200k cells) and it takes apparently very long. It has approx 80k regions.
I am running the mallet version of pycisTopic, and the function has these params:
Is there a way I can speed up computations? At the moment it runs for more than 4 days, and I have plans to run it on an even bigger dataset (1M cells), and I have the feeling I might be doing something wrong, and that maybe I could do something differently (maybe not use Mallet? not sure). Do you have suggestions on this?
The machine it runs on has 64 CPUs and 500GB of RAM available.
Version information pycisTopic: 1.0.3.dev20+g8955c76