Running pycisTopic on very large datasets [PERFORMANCE]

What type of problem are you experiencing and which function is you problem related too I am running cisTopic on a very large dataset (200k cells) and it takes apparently very long. It has approx 80k regions.

I am running the mallet version of pycisTopic, and the function has these params:

models=run_cgs_models_mallet(path_to_mallet_binary,
                    cistopic_obj,
                    n_topics=[2,5,10,15,20,25,30,35,40,45,50,60,70,80,90,100,150],
                    n_cpu=64,
                    n_iter=500,
                    random_state=420,
                    alpha=50,
                    alpha_by_topic=True,
                    eta=0.1,
                    eta_by_topic=False,
                    tmp_path=tmp_path, #Use SCRATCH if many models or big data set
                    save_path=None)

Is there a way I can speed up computations? At the moment it runs for more than 4 days, and I have plans to run it on an even bigger dataset (1M cells), and I have the feeling I might be doing something wrong, and that maybe I could do something differently (maybe not use Mallet? not sure). Do you have suggestions on this?

The machine it runs on has 64 CPUs and 500GB of RAM available.

Version information pycisTopic: 1.0.3.dev20+g8955c76

Hi @simozhou

This step can take a long time, however 4 days is still a lot.

Did any intermediate models finish in this time, or is it stuck at running the model with 2 topics?

I would also suggest to specify a save_path: Path to save models as independent files as they are completed. This is recommended for large data sets. Default: None.. This will save any intermediate models.

All the best,

Seppe

Hi @SeppeDeWinter,

Thank you so much for your feedback!

All models do run eventually, although very slowly (2 topics runs faster, then for obvious reasons larger models with more topics are slower).

I will definitely add a save path to avoid recalculating all models every time.

I am providing 450GB of RAM for this job. Do you believe that a larger amount of RAM may help with the speed of computations?

Thanks again and best regards, Simone

Hi @simozhou

450 GB of RAM should be enough. I'm not sure why it's running so slowly for you...

All the best,

Seppe

I am also running mallet with a very large dataset. I have saved intermediate models, in case it terminates before completion. I am wondering how I can combine multiple runs to combine the different topic modelings under mallet.pkl in this case?

Hi @tiffanywc

We store each model as an entry in a list. Some pseudocode below

import os
import pickle

models = []
for file in os.lisdir(<PATH_TO_DIRECTORY_WITH_MODELS>:
   # check wether file is a result from topic modelling, e.g. based on the name
   if file.endswith(".pkl"):
      model = pickle.load(open(os.path.join(<PATH_TO_DIRECTORY_WITH_MODELS>, file), "rb"))
      models.append(model)

I hope this helps?

All the best,

Seppe

Hello @simozhou,

I'm wondering if you managed to find a resolution, because I'm currently facing a similar challenge:

Model details: 2 topics
cistopic object: 300K cells and 161k regions
Runtime: Exceeding 2 days
Resources: 40 CPUs allocated

Despite the seemingly small number of topics and substantial computational resources, the process is taking an unexpectedly long time. Have you encountered any solutions or optimizations that might help in this scenario? Any insights or workarounds you've discovered would be greatly appreciated.

Thank you!

Hi @TemiLeke,

In short, no, I have not yet solved my time problem. There are a few improvements that helped make it at least tolerable.

Saving topics at every iteration helped a lot to avoid re-running the whole experiment if something failed (usually a TIMEOUT error from the HPC 😅)
Setting reuse_corpus=True also helped a lot, as I have realised that the mallet compressed object was re-written every time, and this saved some time.
If you are working on an HPC, make sure that the number of nodes you are using is not more than one. The algorithm is not optimised to work in a distributed fashion and this would make things much slower than they should. I was running cisTopic with 128 CPUs, only to realise that all nodes on my HPC had 64 CPUs, paradoxically slowing computations down!

This is the code I'm currently using:

# this would be the first time we run cisTopic on this data set
models=run_cgs_models_mallet(path_to_mallet_binary,
                    cistopic_obj,
                    n_topics=[10,15,20,50,60,70,80,90,100,150,200],
                    n_cpu=64,
                    n_iter=500,
                    random_state=420,
                    alpha=50,
                    alpha_by_topic=True,
                    eta=0.1,
                    eta_by_topic=False,
                    tmp_path=tmp_path, #Use SCRATCH if many models or big data set
                    save_path=os.path.join(args.outdir, 'models'),
                    reuse_corpus=True)

I would like to point out that the computational time is still very slow, and it would be good to address this problem. I have been running my 1 million cells dataset and it took 8 days of computations to run with the aforementioned parameters. (which was kinda foreseen, but it would be ideal to shorten this time for the next iteration if possible :) )

@SeppeDeWinter is there something we can do to help? I would be happy to contribute and possibly figure out why this is so slow!

Thanks a lot for the detailed reply @simozhou. I'm currently trying this out. I unfortunately only have access to a 40-core system, so it would even take longer.

I agree it would be good to address the problem, and I'd be very happy to contribute in any capacity. @SeppeDeWinter

Hi @simozhou and @TemiLeke, I am running (or trying to run) topic modelling on a dataset with almost 1.5 million cells and 600,000 regions. Indeed these operations require a lot of memory and take a long time, but with the latest command line version, the speed of pre- and postprocessing of the corpus is already twice as fast as what it used to be, so you can maybe try to use that one. I have not yet managed to let the full topic modelling finish, since I am still trying to figure out how much memory I should exactly use (was requesting too little, giving me out of memory errors), but most of the time the mallet training step completes, while the memory bottleneck is in the loading of the assigned topics. The cli code I am using now looks like this, and I am only running it for one topic number at the time: pycistopic topic_modeling mallet -i $cto_path -o $out_path -t $topic_number -p 40 -n 400 -a 50 -e 0.1 -s 555 -k True -T $temp_dir -m 1000 -b $mallet_path -r True

HI @JulieDeMan

I haven't tried out the CL version yet, but it would be interesting to see how significantly it speeds up the training. The memory consuming part of the pipeline indeed has to do with the loading of assigned topics. I dug a little further and found that the problem somehow has to do with the groupby and agg operation in the load_word_topics module line 635, even though lazy loading defers the execution. I repeatedly encountered the issue even when 1TB of RAM was provisioned, which is the maximum I'll get on a cluster.

To resolve this, I developed a custom function that significantly reduces memory usage, albeit at the cost of increased processing time (see below). It involves loading and processing the topics in smaller chunks, such that the groupby and agg operations are performed on each chunk, and the result is subsequently merged into a larger Polars dataframe, hence the increased processing time.

Here's how I implemented this in pycistopic,

I created a custom function called custom_polars_read_csv()
In the LDAMallet class, I replaced the pl.read_csv function call on line 625 with custom_polars_read_csv(self.fstate(), chunk_size=2_000_000)
The custom_polars_read_csv() function is defined as follows:

import polars as pl
import gzip
import io

def custom_polars_read_csv(file_path, chunk_size=2_000_000):
    """
    This function is designed as an alternative to polars.read_csv for the large gzipped CSV files (created by MALLET)
    where memory consumption is a concern. It uses a combination of lazy evaluation and
    streaming to reduce memory usage during processing, at the cost of increased processing time.

    Key features:
    1. Streams the gzipped file, avoiding loading the entire file into memory at once.
    2. Uses Polars' lazy execution to optimize query planning and execution.
    3. Performs grouping and aggregation on the lazy DataFrame.

    Compared to standard polars.read_csv:
    - Pros: Significantly lower memory usage, especially for large files.
    - Cons: Slower processing speed due to streaming and lazy evaluation.

    Args:
    file_path (str): Path to the gzipped CSV file.
    chunk_size (int): Number of lines to process in each chunk.

    Returns:
    polars.DataFrame: A DataFrame containing the grouped and aggregated results.
    """
    pl.enable_string_cache()

    # initialize an empty DataFrame to store the aggregated results with correct schema
    result = pl.DataFrame(schema={"topic": pl.Int64, "region": pl.Int64, "occurrence": pl.UInt64})

    with gzip.open(file_path, 'rt') as f:
        # Skip the first 3 rows
        for _ in range(3):
            next(f)

        while True:
            chunk = io.StringIO()
            for _ in range(chunk_size):
                line = f.readline()
                if not line:
                    break
                chunk.write(line)

            if chunk.getvalue() == '':
                break

            chunk.seek(0)
            df = pl.read_csv(
                chunk,
                separator=" ",
                has_header=False,
                columns=[4, 5],
                new_columns=["region", "topic"],
                schema_overrides={"region": pl.Int64, "topic": pl.Int64}
            )

            # perform the group-by and aggregation on the chunk
            chunk_result = df.group_by(["topic", "region"]).agg(pl.len().cast(pl.UInt64).alias("occurrence"))

            # merge the chunk result with the overall result
            result = pl.concat([result, chunk_result]).group_by(["topic", "region"]).sum()

    return result

Please note that this is a crude application and there could possibly be a more efficient approach.

The whole Mallet related code in pycisTopic was recently rewritten in the polars_1xx branch. For example the output parsing code was changed in the polars_1xx branch and instead of the --output-state file --word-topic-counts-file file is written when running Mallet (which can be writen and read much faster (24 hours difference for big modeling runs): https://github.com/aertslab/pycisTopic/commit/5d8cd371534548bbaed1dbf3ac555c6668fd8eb9

https://github.com/aertslab/pycisTopic/commits/polars_1xx/

Running Mallet via pycistopic CLI also got an update:

$ pycistopic topic_modeling mallet -h
usage: pycistopic topic_modeling mallet [-h] {create_corpus,run,stats} ...

options:
  -h, --help            show this help message and exit

Topic modeling with "Mallet":
  {create_corpus,run,stats}
                        List of "Mallet" topic modeling subcommands.
    create_corpus       Convert binary accessibility matrix to Mallet serialized corpus file.
    run                 Run LDA topic modeling with "Mallet".
    stats               Calculate model evaluation statistics.

$ pycistopic topic_modeling mallet create_corpus -h
usage: pycistopic topic_modeling mallet create_corpus [-h] -i BINARY_ACCESSIBILITY_MATRIX_FILENAME -o MALLET_CORPUS_FILENAME [-m MEMORY_IN_GB] [-b MALLET_PATH] [-v]

options:
  -h, --help            show this help message and exit
  -i BINARY_ACCESSIBILITY_MATRIX_FILENAME, --input BINARY_ACCESSIBILITY_MATRIX_FILENAME
                        Binary accessibility matrix (region IDs vs cell barcodes) in Matrix Market format.
  -o MALLET_CORPUS_FILENAME, --output MALLET_CORPUS_FILENAME
                        Mallet serialized corpus filename.
  -m MEMORY_IN_GB, --memory MEMORY_IN_GB
                        Amount of memory (in GB) Mallet is allowed to use. Default: "10"
  -b MALLET_PATH, --mallet_path MALLET_PATH
                        Path to Mallet binary (e.g. "/xxx/Mallet/bin/mallet"). Default: "mallet".
  -v, --verbose         Enable verbose mode.

$ pycistopic topic_modeling mallet run -h
usage: pycistopic topic_modeling mallet run [-h] -i MALLET_CORPUS_FILENAME -o OUTPUT_PREFIX -t TOPICS [TOPICS ...] -p PARALLEL [-n ITERATIONS] [-a ALPHA] [-A {True,False}] [-e ETA] [-E {True,False}] [-s SEED] [-m MEMORY_IN_GB] [-b MALLET_PATH]
                                            [-v]

options:
  -h, --help            show this help message and exit
  -i MALLET_CORPUS_FILENAME, --input MALLET_CORPUS_FILENAME
                        Mallet corpus filename.
  -o OUTPUT_PREFIX, --output OUTPUT_PREFIX
                        Topic model output prefix.
  -t TOPICS [TOPICS ...], --topics TOPICS [TOPICS ...]
                        Number(s) of topics to create during topic modeling.
  -p PARALLEL, --parallel PARALLEL
                        Number of threads Mallet is allowed to use.
  -n ITERATIONS, --iterations ITERATIONS
                        Number of iterations. Default: 150.
  -a ALPHA, --alpha ALPHA
                        Alpha value. Default: 50.
  -A {True,False}, --alpha_by_topic {True,False}
                        Whether the alpha value should by divided by the number of topics. Default: True.
  -e ETA, --eta ETA     Eta value. Default: 0.1.
  -E {True,False}, --eta_by_topic {True,False}
                        Whether the eta value should by divided by the number of topics. Default: False.
  -s SEED, --seed SEED  Seed for ensuring reproducibility. Default: 555.
  -m MEMORY_IN_GB, --memory MEMORY_IN_GB
                        Amount of memory (in GB) Mallet is allowed to use. Default: "100"
  -b MALLET_PATH, --mallet_path MALLET_PATH
                        Path to Mallet binary (e.g. "/xxx/Mallet/bin/mallet"). Default: "mallet".
  -v, --verbose         Enable verbose mode.

$ pycistopic topic_modeling mallet stats -h
usage: pycistopic topic_modeling mallet stats [-h] -i BINARY_ACCESSIBILITY_MATRIX_FILENAME -c CELL_BARCODES_FILENAME -r REGION_IDS_FILENAME -o OUTPUT_PREFIX -t TOPICS [TOPICS ...] [-v]

options:
  -h, --help            show this help message and exit
  -i BINARY_ACCESSIBILITY_MATRIX_FILENAME, --input BINARY_ACCESSIBILITY_MATRIX_FILENAME
                        Binary accessibility matrix (region IDs vs cell barcodes) in Matrix Market format.
  -c CELL_BARCODES_FILENAME, --cb CELL_BARCODES_FILENAME
                        Filename with cell barcodes.
  -r REGION_IDS_FILENAME, --regions REGION_IDS_FILENAME
                        Filename with region IDs.
  -o OUTPUT_PREFIX, --output OUTPUT_PREFIX
                        Topic model output prefix.
  -t TOPICS [TOPICS ...], --topics TOPICS [TOPICS ...]
                        Topic number(s) to create the model evaluation statistics for.
  -v, --verbose         Enable verbose mode.

Example on how to use the updated code

Requirements: Environment should contain: https://github.com/aertslab/pycisTopic/tree/polars_1xx

import pickle
import gzip 
from scipy import sparse
from scipy.io import mmwrite
import pycisTopic 
pycisTopic.__version__

'2.0a0'

import os
os.environ['PATH'] = '../../../../mambaforge/vsc34498/envs/pycistopic_polars_nov24/bin:' + os.environ['PATH']
!which pycistopic

../../../../mambaforge/vsc34498/envs/pycistopic_polars_nov24/bin/pycistopic

# requirements: download mallet
!wget https://github.com/mimno/Mallet/releases/download/v202108/Mallet-202108-bin.tar.gz
!tar -xf Mallet-202108-bin.tar.gz

Load and prepare data

cto = 'data/cistopic_obj.pkl.gz'
cistopic_obj = pickle.load(gzip.open(cto,"rb"))
print(cistopic_obj)

CistopicObject from project cisTopic_merge with n_cells × n_regions = 15729 × 885305

# get binary matrix
binary_matrix = cistopic_obj.binary_matrix
# write to file
mmwrite("data/binary_accessibility_matrix.mtx", binary_matrix)
binary_matrix

<885305x15729 sparse matrix of type '<class 'numpy.int32'>'
    with 424253009 stored elements in Compressed Sparse Row format>

# get region names and cell barcodes
regions = cistopic_obj.region_names
cells = cistopic_obj.cell_names

# write to files
with open('data/region_names.txt', 'w') as f:
    for name in regions:
        f.write(f"{name}\n")

with open('data/cell_barcodes.txt', 'w') as f:
    for name in cells:
        f.write(f"{name}\n")

Run topic modeling

create corpus

%%time
binary_matrix_path='data/binary_accessibility_matrix.mtx'
outfile='out/corpus'
mallet_path='../../software/mallet/Mallet-202108/bin/mallet'

## create corpus 
os.system(f"pycistopic topic_modeling mallet create_corpus -i {binary_matrix_path} -o {outfile} -m 100 -b {mallet_path}")

Read binary accessibility matrix from "data/binary_accessibility_matrix.mtx" Matrix Market file.
Convert binary accessibility matrix to Mallet serialized corpus file "out/corpus".
CPU times: user 8.19 ms, sys: 5.99 ms, total: 14.2 ms
Wall time: 4min 54s

run topic modeling

%%time
mallet_corpus_path='out/corpus'
output_prefix='out/models/tm'

## run topic modeling
os.system(f"pycistopic topic_modeling mallet run -i {mallet_corpus_path} -o {output_prefix} -t 5 10 15 20 25 30 -p 8 -n 150 -a 50 -A True -e 0.1 -E False -s 555 -m 300 -b {mallet_path} -v")

Run topic modeling with Mallet with the following settings:
  - Mallet corpus filename:                     out/corpus
  - Output prefix:                              out/models/tm
  - Number of topics to run topic modeling for: [5, 10, 15, 20, 25, 30]
  - Alpha:                                      50
  - Divide alpha by the number of topics:       True
  - Eta:                                        0.1
  - Divide eta by the number of topics:         False
  - Number of iterations:                       150
  - Number threads Mallet is allowed to use:    8
  - Seed:                                       555
  - Amount of memory Mallet is allowed to use:  300G
  - Mallet binary:                              ../../software/mallet/Mallet-202108/bin/mallet

Running Mallet topic modeling for 5 topics.
-------------------------------------------
2024-11-08 09:41:30,136 LDAMallet    INFO     Train topics with Mallet LDA: ../../software/mallet/Mallet-202108/bin/mallet train-topics --input out/corpus --num-topics 5 --alpha 50 --beta 0.02 --optimize-interval 0 --num-threads 8 --num-iterations 150 --word-topic-counts-file out/models/tm.5_topics.region_topic_counts.txt --output-doc-topics out/models/tm.5_topics.cell_topic_probabilities.txt --doc-topics-threshold 0.0 --random-seed 555
2024-11-08 10:32:50,467 LDAMallet    INFO     Write cell-topic probabilities to "out/models/tm.5_topics.cell_topic_probabilities.parquet".
2024-11-08 10:33:00,407 LDAMallet    INFO     Write region-topic counts to "out/models/tm.5_topics.region_topic_counts.parquet".
2024-11-08 10:33:08,113 LDAMallet    INFO     Write JSON parameters file to "out/models/tm.5_topics.parameters.json".

Writing Mallet topic modeling output files to "out/models/tm.5_topics.*"...

Running Mallet topic modeling for 10 topics.
--------------------------------------------
2024-11-08 10:33:08,194 LDAMallet    INFO     Train topics with Mallet LDA: ../../software/mallet/Mallet-202108/bin/mallet train-topics --input out/corpus --num-topics 10 --alpha 50 --beta 0.01 --optimize-interval 0 --num-threads 8 --num-iterations 150 --word-topic-counts-file out/models/tm.10_topics.region_topic_counts.txt --output-doc-topics out/models/tm.10_topics.cell_topic_probabilities.txt --doc-topics-threshold 0.0 --random-seed 555
2024-11-08 11:36:48,765 LDAMallet    INFO     Write cell-topic probabilities to "out/models/tm.10_topics.cell_topic_probabilities.parquet".
2024-11-08 11:37:01,800 LDAMallet    INFO     Write region-topic counts to "out/models/tm.10_topics.region_topic_counts.parquet".
2024-11-08 11:37:10,882 LDAMallet    INFO     Write JSON parameters file to "out/models/tm.10_topics.parameters.json".

Writing Mallet topic modeling output files to "out/models/tm.10_topics.*"...

Running Mallet topic modeling for 15 topics.
--------------------------------------------
2024-11-08 11:37:10,917 LDAMallet    INFO     Train topics with Mallet LDA: ../../software/mallet/Mallet-202108/bin/mallet train-topics --input out/corpus --num-topics 15 --alpha 50 --beta 0.006666666666666667 --optimize-interval 0 --num-threads 8 --num-iterations 150 --word-topic-counts-file out/models/tm.15_topics.region_topic_counts.txt --output-doc-topics out/models/tm.15_topics.cell_topic_probabilities.txt --doc-topics-threshold 0.0 --random-seed 555
2024-11-08 12:58:49,345 LDAMallet    INFO     Write cell-topic probabilities to "out/models/tm.15_topics.cell_topic_probabilities.parquet".
2024-11-08 12:59:03,232 LDAMallet    INFO     Write region-topic counts to "out/models/tm.15_topics.region_topic_counts.parquet".
2024-11-08 12:59:13,105 LDAMallet    INFO     Write JSON parameters file to "out/models/tm.15_topics.parameters.json".

Writing Mallet topic modeling output files to "out/models/tm.15_topics.*"...

Running Mallet topic modeling for 20 topics.
--------------------------------------------
2024-11-08 12:59:13,301 LDAMallet    INFO     Train topics with Mallet LDA: ../../software/mallet/Mallet-202108/bin/mallet train-topics --input out/corpus --num-topics 20 --alpha 50 --beta 0.005 --optimize-interval 0 --num-threads 8 --num-iterations 150 --word-topic-counts-file out/models/tm.20_topics.region_topic_counts.txt --output-doc-topics out/models/tm.20_topics.cell_topic_probabilities.txt --doc-topics-threshold 0.0 --random-seed 555
2024-11-08 14:20:37,883 LDAMallet    INFO     Write cell-topic probabilities to "out/models/tm.20_topics.cell_topic_probabilities.parquet".
2024-11-08 14:20:50,946 LDAMallet    INFO     Write region-topic counts to "out/models/tm.20_topics.region_topic_counts.parquet".
2024-11-08 14:21:01,246 LDAMallet    INFO     Write JSON parameters file to "out/models/tm.20_topics.parameters.json".

Writing Mallet topic modeling output files to "out/models/tm.20_topics.*"...

Running Mallet topic modeling for 25 topics.
--------------------------------------------
2024-11-08 14:21:01,505 LDAMallet    INFO     Train topics with Mallet LDA: ../../software/mallet/Mallet-202108/bin/mallet train-topics --input out/corpus --num-topics 25 --alpha 50 --beta 0.004 --optimize-interval 0 --num-threads 8 --num-iterations 150 --word-topic-counts-file out/models/tm.25_topics.region_topic_counts.txt --output-doc-topics out/models/tm.25_topics.cell_topic_probabilities.txt --doc-topics-threshold 0.0 --random-seed 555
2024-11-08 15:42:55,972 LDAMallet    INFO     Write cell-topic probabilities to "out/models/tm.25_topics.cell_topic_probabilities.parquet".
2024-11-08 15:43:08,348 LDAMallet    INFO     Write region-topic counts to "out/models/tm.25_topics.region_topic_counts.parquet".
2024-11-08 15:43:19,204 LDAMallet    INFO     Write JSON parameters file to "out/models/tm.25_topics.parameters.json".

Writing Mallet topic modeling output files to "out/models/tm.25_topics.*"...

Running Mallet topic modeling for 30 topics.
--------------------------------------------
2024-11-08 15:43:19,758 LDAMallet    INFO     Train topics with Mallet LDA: ../../software/mallet/Mallet-202108/bin/mallet train-topics --input out/corpus --num-topics 30 --alpha 50 --beta 0.0033333333333333335 --optimize-interval 0 --num-threads 8 --num-iterations 150 --word-topic-counts-file out/models/tm.30_topics.region_topic_counts.txt --output-doc-topics out/models/tm.30_topics.cell_topic_probabilities.txt --doc-topics-threshold 0.0 --random-seed 555
2024-11-08 17:09:20,901 LDAMallet    INFO     Write cell-topic probabilities to "out/models/tm.30_topics.cell_topic_probabilities.parquet".
2024-11-08 17:09:33,669 LDAMallet    INFO     Write region-topic counts to "out/models/tm.30_topics.region_topic_counts.parquet".
2024-11-08 17:09:43,965 LDAMallet    INFO     Write JSON parameters file to "out/models/tm.30_topics.parameters.json".

Writing Mallet topic modeling output files to "out/models/tm.30_topics.*"...
CPU times: user 768 ms, sys: 286 ms, total: 1.05 s
Wall time: 7h 28min 24s

assemble models


%%time
binary_matrix_path='data/binary_accessibility_matrix.mtx'
cells_path='data/cell_barcodes.txt'
regions_path='data/region_names.txt'
output_prefix='out/models/tm'

assemble topic models

os.system(f"pycistopic topic_modeling mallet stats -i {binary_matrix_path} -c {cells_path} -r{regions_path} -o {output_prefix} -t 5 10 15 20 25 30 -v")

```text
Read binary accessibility matrix from "data/binary_accessibility_matrix.mtx" Matrix Market file.
Read cell barcodes filename "data/cell_barcodes.txt".
Read region IDs filename "data/region_names.txt".
Calculate model evaluation statistics for 5 topics from "out/models/tm.5_topics.*"...
2024-11-12 09:28:18,603 cisTopic     INFO     Saving model with 5 topics at out/models/tm.5_topics.model.pkl
Calculate model evaluation statistics for 10 topics from "out/models/tm.10_topics.*"...
2024-11-12 09:29:03,684 cisTopic     INFO     Saving model with 10 topics at out/models/tm.10_topics.model.pkl
Calculate model evaluation statistics for 15 topics from "out/models/tm.15_topics.*"...
2024-11-12 09:29:58,803 cisTopic     INFO     Saving model with 15 topics at out/models/tm.15_topics.model.pkl
Calculate model evaluation statistics for 20 topics from "out/models/tm.20_topics.*"...
2024-11-12 09:31:04,088 cisTopic     INFO     Saving model with 20 topics at out/models/tm.20_topics.model.pkl
Calculate model evaluation statistics for 25 topics from "out/models/tm.25_topics.*"...
2024-11-12 09:32:19,229 cisTopic     INFO     Saving model with 25 topics at out/models/tm.25_topics.model.pkl
Calculate model evaluation statistics for 30 topics from "out/models/tm.30_topics.*"...
2024-11-12 09:33:43,728 cisTopic     INFO     Saving model with 30 topics at out/models/tm.30_topics.model.pkl
CPU times: user 11.6 ms, sys: 5.01 ms, total: 16.6 ms
Wall time: 6min 7s

@JulieDeMan Thank you very much for providing the updated example code. It is extremely helpful, and I am excited to test it out to potentially speed up my analysis.

I noticed that a couple of parameters differ in the topic modeling step compared to the pycisTopic tutorial (https://pycistopic.readthedocs.io/en/latest/notebooks/human_cerebellum.html):

-n (number of iterations): 150 in your example vs. 500in the tutorial. -E (eta_by_topic): set to Truein your example vs. Falsein the tutorial. Since I already obtained some results using the standard pycisTopic pipeline, I am inclined to keep these parameters unchanged in the updated code unless these differences are critical to the new pipeline.

Do you have specific recommendations for choosing one setting over the other? I’d greatly appreciate your insights!

Sébastien

Hi @Sebdumas,

Regarding the number of iterations, we are currently using a lower number because we observed that for most of our datasets, the log-likelihood tends to converge around 150-200 iterations. In most cases, there are no significant improvements beyond that point. Therefore, we reasoned that it is more efficient — primarily in terms of time — to reduce the number of iterations. So it's safe to keep your original number of iterations, it will just take more time.

The eta_by_topic parameter is a mistake from my side, this one should indeed be set to False as in the tutorial. Thanks for noticing, I'll update the example to avoid confusion.

aertslab / pycisTopic

Running pycisTopic on very large datasets [PERFORMANCE] #106

Example on how to use the updated code

assemble topic models