Open simozhou opened 10 months ago
Hi @simozhou
This step can take a long time, however 4 days is still a lot.
Did any intermediate models finish in this time, or is it stuck at running the model with 2 topics?
I would also suggest to specify a save_path
: Path to save models as independent files as they are completed. This is recommended for large data sets. Default: None.
. This will save any intermediate models.
All the best,
Seppe
Hi @SeppeDeWinter,
Thank you so much for your feedback!
All models do run eventually, although very slowly (2 topics runs faster, then for obvious reasons larger models with more topics are slower).
I will definitely add a save path to avoid recalculating all models every time.
I am providing 450GB of RAM for this job. Do you believe that a larger amount of RAM may help with the speed of computations?
Thanks again and best regards, Simone
Hi @simozhou
450 GB of RAM should be enough. I'm not sure why it's running so slowly for you...
All the best,
Seppe
I am also running mallet with a very large dataset. I have saved intermediate models, in case it terminates before completion. I am wondering how I can combine multiple runs to combine the different topic modelings under mallet.pkl in this case?
Hi @tiffanywc
We store each model as an entry in a list. Some pseudocode below
import os
import pickle
models = []
for file in os.lisdir(<PATH_TO_DIRECTORY_WITH_MODELS>:
# check wether file is a result from topic modelling, e.g. based on the name
if file.endswith(".pkl"):
model = pickle.load(open(os.path.join(<PATH_TO_DIRECTORY_WITH_MODELS>, file), "rb"))
models.append(model)
I hope this helps?
All the best,
Seppe
Hello @simozhou,
I'm wondering if you managed to find a resolution, because I'm currently facing a similar challenge:
Despite the seemingly small number of topics and substantial computational resources, the process is taking an unexpectedly long time. Have you encountered any solutions or optimizations that might help in this scenario? Any insights or workarounds you've discovered would be greatly appreciated.
Thank you!
Hi @TemiLeke,
In short, no, I have not yet solved my time problem. There are a few improvements that helped make it at least tolerable.
reuse_corpus=True
also helped a lot, as I have realised that the mallet compressed object was re-written every time, and this saved some time.This is the code I'm currently using:
# this would be the first time we run cisTopic on this data set
models=run_cgs_models_mallet(path_to_mallet_binary,
cistopic_obj,
n_topics=[10,15,20,50,60,70,80,90,100,150,200],
n_cpu=64,
n_iter=500,
random_state=420,
alpha=50,
alpha_by_topic=True,
eta=0.1,
eta_by_topic=False,
tmp_path=tmp_path, #Use SCRATCH if many models or big data set
save_path=os.path.join(args.outdir, 'models'),
reuse_corpus=True)
I would like to point out that the computational time is still very slow, and it would be good to address this problem. I have been running my 1 million cells dataset and it took 8 days of computations to run with the aforementioned parameters. (which was kinda foreseen, but it would be ideal to shorten this time for the next iteration if possible :) )
@SeppeDeWinter is there something we can do to help? I would be happy to contribute and possibly figure out why this is so slow!
Thanks a lot for the detailed reply @simozhou. I'm currently trying this out. I unfortunately only have access to a 40-core system, so it would even take longer.
I agree it would be good to address the problem, and I'd be very happy to contribute in any capacity. @SeppeDeWinter
Hi @simozhou and @TemiLeke, I am running (or trying to run) topic modelling on a dataset with almost 1.5 million cells and 600,000 regions. Indeed these operations require a lot of memory and take a long time, but with the latest command line version, the speed of pre- and postprocessing of the corpus is already twice as fast as what it used to be, so you can maybe try to use that one. I have not yet managed to let the full topic modelling finish, since I am still trying to figure out how much memory I should exactly use (was requesting too little, giving me out of memory errors), but most of the time the mallet training step completes, while the memory bottleneck is in the loading of the assigned topics. The cli code I am using now looks like this, and I am only running it for one topic number at the time: pycistopic topic_modeling mallet -i $cto_path -o $out_path -t $topic_number -p 40 -n 400 -a 50 -e 0.1 -s 555 -k True -T $temp_dir -m 1000 -b $mallet_path -r True
HI @JulieDeMan
I haven't tried out the CL version yet, but it would be interesting to see how significantly it speeds up the training. The memory consuming part of the pipeline indeed has to do with the loading of assigned topics
. I dug a little further and found that the problem somehow has to do with the groupby
and agg
operation in the load_word_topics
module line 635, even though lazy loading defers the execution. I repeatedly encountered the issue even when 1TB of RAM was provisioned, which is the maximum I'll get on a cluster.
To resolve this, I developed a custom function that significantly reduces memory usage, albeit at the cost of increased processing time (see below). It involves loading and processing the topics in smaller chunks, such that the groupby
and agg
operations are performed on each chunk, and the result is subsequently merged into a larger Polars dataframe, hence the increased processing time.
Here's how I implemented this in pycistopic,
custom_polars_read_csv()
pl.read_csv
function call on line 625 with custom_polars_read_csv(self.fstate(), chunk_size=2_000_000)
import polars as pl
import gzip
import io
def custom_polars_read_csv(file_path, chunk_size=2_000_000):
"""
This function is designed as an alternative to polars.read_csv for the large gzipped CSV files (created by MALLET)
where memory consumption is a concern. It uses a combination of lazy evaluation and
streaming to reduce memory usage during processing, at the cost of increased processing time.
Key features:
1. Streams the gzipped file, avoiding loading the entire file into memory at once.
2. Uses Polars' lazy execution to optimize query planning and execution.
3. Performs grouping and aggregation on the lazy DataFrame.
Compared to standard polars.read_csv:
- Pros: Significantly lower memory usage, especially for large files.
- Cons: Slower processing speed due to streaming and lazy evaluation.
Args:
file_path (str): Path to the gzipped CSV file.
chunk_size (int): Number of lines to process in each chunk.
Returns:
polars.DataFrame: A DataFrame containing the grouped and aggregated results.
"""
pl.enable_string_cache()
# initialize an empty DataFrame to store the aggregated results with correct schema
result = pl.DataFrame(schema={"topic": pl.Int64, "region": pl.Int64, "occurrence": pl.UInt64})
with gzip.open(file_path, 'rt') as f:
# Skip the first 3 rows
for _ in range(3):
next(f)
while True:
chunk = io.StringIO()
for _ in range(chunk_size):
line = f.readline()
if not line:
break
chunk.write(line)
if chunk.getvalue() == '':
break
chunk.seek(0)
df = pl.read_csv(
chunk,
separator=" ",
has_header=False,
columns=[4, 5],
new_columns=["region", "topic"],
schema_overrides={"region": pl.Int64, "topic": pl.Int64}
)
# perform the group-by and aggregation on the chunk
chunk_result = df.group_by(["topic", "region"]).agg(pl.len().cast(pl.UInt64).alias("occurrence"))
# merge the chunk result with the overall result
result = pl.concat([result, chunk_result]).group_by(["topic", "region"]).sum()
return result
Please note that this is a crude application and there could possibly be a more efficient approach.
The whole Mallet related code in pycisTopic was recently rewritten in the polars_1xx branch.
For example the output parsing code was changed in the polars_1xx branch and instead of the --output-state
file
--word-topic-counts-file
file is written when running Mallet (which can be writen and read much faster (24 hours difference for big modeling runs): https://github.com/aertslab/pycisTopic/commit/5d8cd371534548bbaed1dbf3ac555c6668fd8eb9
https://github.com/aertslab/pycisTopic/commits/polars_1xx/
Running Mallet via pycistopic CLI also got an update:
$ pycistopic topic_modeling mallet -h
usage: pycistopic topic_modeling mallet [-h] {create_corpus,run,stats} ...
options:
-h, --help show this help message and exit
Topic modeling with "Mallet":
{create_corpus,run,stats}
List of "Mallet" topic modeling subcommands.
create_corpus Convert binary accessibility matrix to Mallet serialized corpus file.
run Run LDA topic modeling with "Mallet".
stats Calculate model evaluation statistics.
$ pycistopic topic_modeling mallet create_corpus -h
usage: pycistopic topic_modeling mallet create_corpus [-h] -i BINARY_ACCESSIBILITY_MATRIX_FILENAME -o MALLET_CORPUS_FILENAME [-m MEMORY_IN_GB] [-b MALLET_PATH] [-v]
options:
-h, --help show this help message and exit
-i BINARY_ACCESSIBILITY_MATRIX_FILENAME, --input BINARY_ACCESSIBILITY_MATRIX_FILENAME
Binary accessibility matrix (region IDs vs cell barcodes) in Matrix Market format.
-o MALLET_CORPUS_FILENAME, --output MALLET_CORPUS_FILENAME
Mallet serialized corpus filename.
-m MEMORY_IN_GB, --memory MEMORY_IN_GB
Amount of memory (in GB) Mallet is allowed to use. Default: "10"
-b MALLET_PATH, --mallet_path MALLET_PATH
Path to Mallet binary (e.g. "/xxx/Mallet/bin/mallet"). Default: "mallet".
-v, --verbose Enable verbose mode.
$ pycistopic topic_modeling mallet run -h
usage: pycistopic topic_modeling mallet run [-h] -i MALLET_CORPUS_FILENAME -o OUTPUT_PREFIX -t TOPICS [TOPICS ...] -p PARALLEL [-n ITERATIONS] [-a ALPHA] [-A {True,False}] [-e ETA] [-E {True,False}] [-s SEED] [-m MEMORY_IN_GB] [-b MALLET_PATH]
[-v]
options:
-h, --help show this help message and exit
-i MALLET_CORPUS_FILENAME, --input MALLET_CORPUS_FILENAME
Mallet corpus filename.
-o OUTPUT_PREFIX, --output OUTPUT_PREFIX
Topic model output prefix.
-t TOPICS [TOPICS ...], --topics TOPICS [TOPICS ...]
Number(s) of topics to create during topic modeling.
-p PARALLEL, --parallel PARALLEL
Number of threads Mallet is allowed to use.
-n ITERATIONS, --iterations ITERATIONS
Number of iterations. Default: 150.
-a ALPHA, --alpha ALPHA
Alpha value. Default: 50.
-A {True,False}, --alpha_by_topic {True,False}
Whether the alpha value should by divided by the number of topics. Default: True.
-e ETA, --eta ETA Eta value. Default: 0.1.
-E {True,False}, --eta_by_topic {True,False}
Whether the eta value should by divided by the number of topics. Default: False.
-s SEED, --seed SEED Seed for ensuring reproducibility. Default: 555.
-m MEMORY_IN_GB, --memory MEMORY_IN_GB
Amount of memory (in GB) Mallet is allowed to use. Default: "100"
-b MALLET_PATH, --mallet_path MALLET_PATH
Path to Mallet binary (e.g. "/xxx/Mallet/bin/mallet"). Default: "mallet".
-v, --verbose Enable verbose mode.
$ pycistopic topic_modeling mallet stats -h
usage: pycistopic topic_modeling mallet stats [-h] -i BINARY_ACCESSIBILITY_MATRIX_FILENAME -c CELL_BARCODES_FILENAME -r REGION_IDS_FILENAME -o OUTPUT_PREFIX -t TOPICS [TOPICS ...] [-v]
options:
-h, --help show this help message and exit
-i BINARY_ACCESSIBILITY_MATRIX_FILENAME, --input BINARY_ACCESSIBILITY_MATRIX_FILENAME
Binary accessibility matrix (region IDs vs cell barcodes) in Matrix Market format.
-c CELL_BARCODES_FILENAME, --cb CELL_BARCODES_FILENAME
Filename with cell barcodes.
-r REGION_IDS_FILENAME, --regions REGION_IDS_FILENAME
Filename with region IDs.
-o OUTPUT_PREFIX, --output OUTPUT_PREFIX
Topic model output prefix.
-t TOPICS [TOPICS ...], --topics TOPICS [TOPICS ...]
Topic number(s) to create the model evaluation statistics for.
-v, --verbose Enable verbose mode.
Requirements: Environment should contain: https://github.com/aertslab/pycisTopic/tree/polars_1xx
import pickle
import gzip
from scipy import sparse
from scipy.io import mmwrite
import pycisTopic
pycisTopic.__version__
'2.0a0'
import os
os.environ['PATH'] = '../../../../mambaforge/vsc34498/envs/pycistopic_polars_nov24/bin:' + os.environ['PATH']
!which pycistopic
../../../../mambaforge/vsc34498/envs/pycistopic_polars_nov24/bin/pycistopic
# requirements: download mallet
!wget https://github.com/mimno/Mallet/releases/download/v202108/Mallet-202108-bin.tar.gz
!tar -xf Mallet-202108-bin.tar.gz
Load and prepare data
cto = 'data/cistopic_obj.pkl.gz'
cistopic_obj = pickle.load(gzip.open(cto,"rb"))
print(cistopic_obj)
CistopicObject from project cisTopic_merge with n_cells × n_regions = 15729 × 885305
# get binary matrix
binary_matrix = cistopic_obj.binary_matrix
# write to file
mmwrite("data/binary_accessibility_matrix.mtx", binary_matrix)
binary_matrix
<885305x15729 sparse matrix of type '<class 'numpy.int32'>'
with 424253009 stored elements in Compressed Sparse Row format>
# get region names and cell barcodes
regions = cistopic_obj.region_names
cells = cistopic_obj.cell_names
# write to files
with open('data/region_names.txt', 'w') as f:
for name in regions:
f.write(f"{name}\n")
with open('data/cell_barcodes.txt', 'w') as f:
for name in cells:
f.write(f"{name}\n")
Run topic modeling
%%time
binary_matrix_path='data/binary_accessibility_matrix.mtx'
outfile='out/corpus'
mallet_path='../../software/mallet/Mallet-202108/bin/mallet'
## create corpus
os.system(f"pycistopic topic_modeling mallet create_corpus -i {binary_matrix_path} -o {outfile} -m 100 -b {mallet_path}")
Read binary accessibility matrix from "data/binary_accessibility_matrix.mtx" Matrix Market file.
Convert binary accessibility matrix to Mallet serialized corpus file "out/corpus".
CPU times: user 8.19 ms, sys: 5.99 ms, total: 14.2 ms
Wall time: 4min 54s
%%time
mallet_corpus_path='out/corpus'
output_prefix='out/models/tm'
## run topic modeling
os.system(f"pycistopic topic_modeling mallet run -i {mallet_corpus_path} -o {output_prefix} -t 5 10 15 20 25 30 -p 8 -n 150 -a 50 -A True -e 0.1 -E False -s 555 -m 300 -b {mallet_path} -v")
Run topic modeling with Mallet with the following settings:
- Mallet corpus filename: out/corpus
- Output prefix: out/models/tm
- Number of topics to run topic modeling for: [5, 10, 15, 20, 25, 30]
- Alpha: 50
- Divide alpha by the number of topics: True
- Eta: 0.1
- Divide eta by the number of topics: False
- Number of iterations: 150
- Number threads Mallet is allowed to use: 8
- Seed: 555
- Amount of memory Mallet is allowed to use: 300G
- Mallet binary: ../../software/mallet/Mallet-202108/bin/mallet
Running Mallet topic modeling for 5 topics.
-------------------------------------------
2024-11-08 09:41:30,136 LDAMallet INFO Train topics with Mallet LDA: ../../software/mallet/Mallet-202108/bin/mallet train-topics --input out/corpus --num-topics 5 --alpha 50 --beta 0.02 --optimize-interval 0 --num-threads 8 --num-iterations 150 --word-topic-counts-file out/models/tm.5_topics.region_topic_counts.txt --output-doc-topics out/models/tm.5_topics.cell_topic_probabilities.txt --doc-topics-threshold 0.0 --random-seed 555
2024-11-08 10:32:50,467 LDAMallet INFO Write cell-topic probabilities to "out/models/tm.5_topics.cell_topic_probabilities.parquet".
2024-11-08 10:33:00,407 LDAMallet INFO Write region-topic counts to "out/models/tm.5_topics.region_topic_counts.parquet".
2024-11-08 10:33:08,113 LDAMallet INFO Write JSON parameters file to "out/models/tm.5_topics.parameters.json".
Writing Mallet topic modeling output files to "out/models/tm.5_topics.*"...
Running Mallet topic modeling for 10 topics.
--------------------------------------------
2024-11-08 10:33:08,194 LDAMallet INFO Train topics with Mallet LDA: ../../software/mallet/Mallet-202108/bin/mallet train-topics --input out/corpus --num-topics 10 --alpha 50 --beta 0.01 --optimize-interval 0 --num-threads 8 --num-iterations 150 --word-topic-counts-file out/models/tm.10_topics.region_topic_counts.txt --output-doc-topics out/models/tm.10_topics.cell_topic_probabilities.txt --doc-topics-threshold 0.0 --random-seed 555
2024-11-08 11:36:48,765 LDAMallet INFO Write cell-topic probabilities to "out/models/tm.10_topics.cell_topic_probabilities.parquet".
2024-11-08 11:37:01,800 LDAMallet INFO Write region-topic counts to "out/models/tm.10_topics.region_topic_counts.parquet".
2024-11-08 11:37:10,882 LDAMallet INFO Write JSON parameters file to "out/models/tm.10_topics.parameters.json".
Writing Mallet topic modeling output files to "out/models/tm.10_topics.*"...
Running Mallet topic modeling for 15 topics.
--------------------------------------------
2024-11-08 11:37:10,917 LDAMallet INFO Train topics with Mallet LDA: ../../software/mallet/Mallet-202108/bin/mallet train-topics --input out/corpus --num-topics 15 --alpha 50 --beta 0.006666666666666667 --optimize-interval 0 --num-threads 8 --num-iterations 150 --word-topic-counts-file out/models/tm.15_topics.region_topic_counts.txt --output-doc-topics out/models/tm.15_topics.cell_topic_probabilities.txt --doc-topics-threshold 0.0 --random-seed 555
2024-11-08 12:58:49,345 LDAMallet INFO Write cell-topic probabilities to "out/models/tm.15_topics.cell_topic_probabilities.parquet".
2024-11-08 12:59:03,232 LDAMallet INFO Write region-topic counts to "out/models/tm.15_topics.region_topic_counts.parquet".
2024-11-08 12:59:13,105 LDAMallet INFO Write JSON parameters file to "out/models/tm.15_topics.parameters.json".
Writing Mallet topic modeling output files to "out/models/tm.15_topics.*"...
Running Mallet topic modeling for 20 topics.
--------------------------------------------
2024-11-08 12:59:13,301 LDAMallet INFO Train topics with Mallet LDA: ../../software/mallet/Mallet-202108/bin/mallet train-topics --input out/corpus --num-topics 20 --alpha 50 --beta 0.005 --optimize-interval 0 --num-threads 8 --num-iterations 150 --word-topic-counts-file out/models/tm.20_topics.region_topic_counts.txt --output-doc-topics out/models/tm.20_topics.cell_topic_probabilities.txt --doc-topics-threshold 0.0 --random-seed 555
2024-11-08 14:20:37,883 LDAMallet INFO Write cell-topic probabilities to "out/models/tm.20_topics.cell_topic_probabilities.parquet".
2024-11-08 14:20:50,946 LDAMallet INFO Write region-topic counts to "out/models/tm.20_topics.region_topic_counts.parquet".
2024-11-08 14:21:01,246 LDAMallet INFO Write JSON parameters file to "out/models/tm.20_topics.parameters.json".
Writing Mallet topic modeling output files to "out/models/tm.20_topics.*"...
Running Mallet topic modeling for 25 topics.
--------------------------------------------
2024-11-08 14:21:01,505 LDAMallet INFO Train topics with Mallet LDA: ../../software/mallet/Mallet-202108/bin/mallet train-topics --input out/corpus --num-topics 25 --alpha 50 --beta 0.004 --optimize-interval 0 --num-threads 8 --num-iterations 150 --word-topic-counts-file out/models/tm.25_topics.region_topic_counts.txt --output-doc-topics out/models/tm.25_topics.cell_topic_probabilities.txt --doc-topics-threshold 0.0 --random-seed 555
2024-11-08 15:42:55,972 LDAMallet INFO Write cell-topic probabilities to "out/models/tm.25_topics.cell_topic_probabilities.parquet".
2024-11-08 15:43:08,348 LDAMallet INFO Write region-topic counts to "out/models/tm.25_topics.region_topic_counts.parquet".
2024-11-08 15:43:19,204 LDAMallet INFO Write JSON parameters file to "out/models/tm.25_topics.parameters.json".
Writing Mallet topic modeling output files to "out/models/tm.25_topics.*"...
Running Mallet topic modeling for 30 topics.
--------------------------------------------
2024-11-08 15:43:19,758 LDAMallet INFO Train topics with Mallet LDA: ../../software/mallet/Mallet-202108/bin/mallet train-topics --input out/corpus --num-topics 30 --alpha 50 --beta 0.0033333333333333335 --optimize-interval 0 --num-threads 8 --num-iterations 150 --word-topic-counts-file out/models/tm.30_topics.region_topic_counts.txt --output-doc-topics out/models/tm.30_topics.cell_topic_probabilities.txt --doc-topics-threshold 0.0 --random-seed 555
2024-11-08 17:09:20,901 LDAMallet INFO Write cell-topic probabilities to "out/models/tm.30_topics.cell_topic_probabilities.parquet".
2024-11-08 17:09:33,669 LDAMallet INFO Write region-topic counts to "out/models/tm.30_topics.region_topic_counts.parquet".
2024-11-08 17:09:43,965 LDAMallet INFO Write JSON parameters file to "out/models/tm.30_topics.parameters.json".
Writing Mallet topic modeling output files to "out/models/tm.30_topics.*"...
CPU times: user 768 ms, sys: 286 ms, total: 1.05 s
Wall time: 7h 28min 24s
%%time
binary_matrix_path='data/binary_accessibility_matrix.mtx'
cells_path='data/cell_barcodes.txt'
regions_path='data/region_names.txt'
output_prefix='out/models/tm'
os.system(f"pycistopic topic_modeling mallet stats -i {binary_matrix_path} -c {cells_path} -r{regions_path} -o {output_prefix} -t 5 10 15 20 25 30 -v")
```text
Read binary accessibility matrix from "data/binary_accessibility_matrix.mtx" Matrix Market file.
Read cell barcodes filename "data/cell_barcodes.txt".
Read region IDs filename "data/region_names.txt".
Calculate model evaluation statistics for 5 topics from "out/models/tm.5_topics.*"...
2024-11-12 09:28:18,603 cisTopic INFO Saving model with 5 topics at out/models/tm.5_topics.model.pkl
Calculate model evaluation statistics for 10 topics from "out/models/tm.10_topics.*"...
2024-11-12 09:29:03,684 cisTopic INFO Saving model with 10 topics at out/models/tm.10_topics.model.pkl
Calculate model evaluation statistics for 15 topics from "out/models/tm.15_topics.*"...
2024-11-12 09:29:58,803 cisTopic INFO Saving model with 15 topics at out/models/tm.15_topics.model.pkl
Calculate model evaluation statistics for 20 topics from "out/models/tm.20_topics.*"...
2024-11-12 09:31:04,088 cisTopic INFO Saving model with 20 topics at out/models/tm.20_topics.model.pkl
Calculate model evaluation statistics for 25 topics from "out/models/tm.25_topics.*"...
2024-11-12 09:32:19,229 cisTopic INFO Saving model with 25 topics at out/models/tm.25_topics.model.pkl
Calculate model evaluation statistics for 30 topics from "out/models/tm.30_topics.*"...
2024-11-12 09:33:43,728 cisTopic INFO Saving model with 30 topics at out/models/tm.30_topics.model.pkl
CPU times: user 11.6 ms, sys: 5.01 ms, total: 16.6 ms
Wall time: 6min 7s
@JulieDeMan Thank you very much for providing the updated example code. It is extremely helpful, and I am excited to test it out to potentially speed up my analysis.
I noticed that a couple of parameters differ in the topic modeling step compared to the pycisTopic tutorial (https://pycistopic.readthedocs.io/en/latest/notebooks/human_cerebellum.html):
-n
(number of iterations): 150
in your example vs. 500
in the tutorial.
-E
(eta_by_topic): set to True
in your example vs. False
in the tutorial.
Since I already obtained some results using the standard pycisTopic pipeline, I am inclined to keep these parameters unchanged in the updated code unless these differences are critical to the new pipeline.
Do you have specific recommendations for choosing one setting over the other? I’d greatly appreciate your insights!
Sébastien
Hi @Sebdumas,
Regarding the number of iterations, we are currently using a lower number because we observed that for most of our datasets, the log-likelihood tends to converge around 150-200 iterations. In most cases, there are no significant improvements beyond that point. Therefore, we reasoned that it is more efficient — primarily in terms of time — to reduce the number of iterations. So it's safe to keep your original number of iterations, it will just take more time.
The eta_by_topic
parameter is a mistake from my side, this one should indeed be set to False
as in the tutorial. Thanks for noticing, I'll update the example to avoid confusion.
What type of problem are you experiencing and which function is you problem related too I am running cisTopic on a very large dataset (200k cells) and it takes apparently very long. It has approx 80k regions.
I am running the mallet version of pycisTopic, and the function has these params:
Is there a way I can speed up computations? At the moment it runs for more than 4 days, and I have plans to run it on an even bigger dataset (1M cells), and I have the feeling I might be doing something wrong, and that maybe I could do something differently (maybe not use Mallet? not sure). Do you have suggestions on this?
The machine it runs on has 64 CPUs and 500GB of RAM available.
Version information pycisTopic: 1.0.3.dev20+g8955c76