Topics has been calculated to number 350, but the loglikelihood is still not optimal.

aertslab / pycisTopic

pycisTopic is a Python module to simultaneously identify cell states and cis-regulatory topics from single cell epigenomics data.

Other

58 stars 12 forks source link

Topics has been calculated to number 350, but the loglikelihood is still not optimal. #114

Open YH-Zheng opened 8 months ago

YH-Zheng commented 8 months ago

I randomly sample 1k cells from each celltype in a data set of 4 million cells, and get an atac matrix of 55243 cells x 165804 peaks. However, when I perform topic calculation, loglikelihood is not reached when the number of topics reaches 350. Does this mean I need to increase the number of topics? But 350 is a large value relative to the example, how do I pick the optimal number of topics?

models=run_cgs_models_mallet(path_to_mallet_binary,
                    cistopic_obj,
                    n_topics=[200,250,300,350],
                    n_cpu=55,
                    n_iter=150,
                    random_state=555,
                    alpha=50,
                    alpha_by_topic=True,
                    eta=0.1,
                    eta_by_topic=False,
                    tmp_path=tmp_dir, #Use SCRATCH if many models or big data set
                    save_path=work_dir,
                    reuse_corpus=True)

download-3

SeppeDeWinter commented 8 months ago

Hi @YH-Zheng

Indeed, choosing the correct number of topics can be a bit tricky and subjective. I would not run models with a larger number of topics. I would choose the model with 200 topics in your case, that's the point where most metrics are maximised.

After selecting the model you should check wether your topics represent your cell types well, based on plotting cell-topic probabilities, i.e. do you have a topic that is specific for each cell type? and based on motif enrichment, are the regions in topics enriched for the motifs that you are expecting?

All the best,

Seppe

YH-Zheng commented 8 months ago

Hi @SeppeDeWinter

Thans for your reply. You mean to make all four indicators as large as possible as the appropriate number of topics, but both of the metrics (Arun_2010, Cao_Juan_2009 ) you mentioned in the tutorial are that the better the model, the lower the metric.

Arun_2010: Uses a density-based metric as in Arun et al (2010) using the topic-region distribution, the cell-topic distribution and the cell coverage. The better the model, the lower the metric. Cao_Juan_2009: Uses a divergence-based metric as in Cao Juan et al (2009) using the topic-region distribution. The better the model, the lower the metric.

If the chosen topic does not separate my ATAC data by my celltype annotation, would it be better to divide all cells into subsets and run subject modeling separately (e.g., B cells, CD4T cells, and many smaller subsets within these large subsets of cells)? Or increase the number of topic？

Best wishes，

Yuhui

SeppeDeWinter commented 8 months ago

Hi @YH-Zheng

You are correct about those two metrics, however for plotting them we invert their values (hence the "inv" prefix).

I would not run topic modelling separately per cell type, you need the background of the other cell types to be able to identify cell type specific regions. In that case I would indeed increate the number of topics.

All the best,

Seppe