dylkot / cNMF

Code and example data for running Consensus Non-negative Matrix Factorization on single-cell RNA-Seq data
MIT License
272 stars 57 forks source link

How to adjust the convergence limit for the underlying sklearn NMF when using the command-line version of cNMF #79

Closed erzakiev closed 6 months ago

erzakiev commented 7 months ago

Hello Dylan, I was wondering how you would adjust the convergence limit for the underlying NMF implementation from the sklearn package, so that the following warning goes away:

sklearn/decomposition/_nmf.py:1641: ConvergenceWarning: Maximum number of iterations 1000 reached. Increase it to improve convergence.  ConvergenceWarning,

Is it possible to do from within command-line? Or it can only be done when using python interactively?

The cnmf -h gives lots of options, but none of them seem to be related:

cnmf -h
usage: cnmf [-h] [--name [NAME]] [--output-dir [OUTPUT_DIR]] [-c COUNTS]
            [-k COMPONENTS [COMPONENTS ...]] [-n N_ITER]
            [--total-workers TOTAL_WORKERS] [--seed SEED]
            [--genes-file GENES_FILE] [--numgenes NUMGENES] [--tpm TPM]
            [--beta-loss {frobenius,kullback-leibler,itakura-saito}]
            [--init {random,nndsvd}] [--densify] [--worker-index WORKER_INDEX]
            [--local-density-threshold LOCAL_DENSITY_THRESHOLD]
            [--local-neighborhood-size LOCAL_NEIGHBORHOOD_SIZE]
            [--show-clustering]
            {prepare,factorize,combine,consensus,k_selection_plot}

positional arguments:
  {prepare,factorize,combine,consensus,k_selection_plot}

optional arguments:
  -h, --help            show this help message and exit
  --name [NAME]         [all] Name for analysis. All output will be placed in
                        [output-dir]/[name]/...
  --output-dir [OUTPUT_DIR]
                        [all] Output directory. All output will be placed in
                        [output-dir]/[name]/...
  -c COUNTS, --counts COUNTS
                        [prepare] Input (cell x gene) counts matrix as df.npz
                        or tab delimited text file
  -k COMPONENTS [COMPONENTS ...], --components COMPONENTS [COMPONENTS ...]
                        [prepare] Numper of components (k) for matrix
                        factorization. Several can be specified with "-k 8 9
                        10"
  -n N_ITER, --n-iter N_ITER
                        [prepare] Numper of factorization replicates
  --total-workers TOTAL_WORKERS
                        [all] Total number of workers to distribute jobs to
  --seed SEED           [prepare] Seed for pseudorandom number generation
  --genes-file GENES_FILE
                        [prepare] File containing a list of genes to include,
                        one gene per line. Must match column labels of counts
                        matrix.
  --numgenes NUMGENES   [prepare] Number of high variance genes to use for
                        matrix factorization.
  --tpm TPM             [prepare] Pre-computed (cell x gene) TPM values as
                        df.npz or tab separated txt file. If not provided TPM
                        will be calculated automatically
  --beta-loss {frobenius,kullback-leibler,itakura-saito}
                        [prepare] Loss function for NMF.
  --init {random,nndsvd}
                        [prepare] Initialization algorithm for NMF.
  --densify             [prepare] Treat the input data as non-sparse
  --worker-index WORKER_INDEX
                        [factorize] Index of current worker (the first worker
                        should have index 0)
  --local-density-threshold LOCAL_DENSITY_THRESHOLD
                        [consensus] Threshold for the local density filtering.
                        This string must convert to a float >0 and <=2
  --local-neighborhood-size LOCAL_NEIGHBORHOOD_SIZE
                        [consensus] Fraction of the number of replicates to
                        use as nearest neighbors for local density filtering
  --show-clustering     [consensus] Produce a clustergram figure summarizing
                        the spectra clustering
dylkot commented 6 months ago

Hey @erzakiev I added this as a parameter in the development branch. It can be set from the command line in the prepare step

cnmf prepare --output-dir example_PBMC/cNMF --name pbmc_cNMF -c example_PBMC/counts.h5ad -k 5 6 7 8 9 10 --n-iter 20 --total-workers 1 --seed 14 --numgenes 2000 --beta-loss frobenius --max-nmf-iter 1000

or from the Python environment

cnmf_obj.prepare(counts_fn=countfn, components=np.arange(5,11), n_iter=20, seed=14,
                 num_highvar_genes=2000, max_NMF_iter=1000)

This will get pushed to the master branch and pypi hopefully in the next week.

However, I would warn that if it isn't converging in 1000 iterations usually something bad is happening (maybe K is way too high or the data is normalized weirdly).

erzakiev commented 5 months ago

Awesome thanks for the info and for the added feature!!

erzakiev commented 5 months ago

Dylan, I wonder if this is related to that, but I noticed that when the algorithm approaches the last, say, 10% of allocated tasks, the time for each task to finish is much longer than with the first several hundred tasks in the beginning when it just only starts the factorization. Is this by design? The latest tasks handle decomposition of the most nasty parts of the matrix or something?

# first several hunder of iterations are always quick
[Worker 3]. Starting task 699.
[Worker 4]. Starting task 628.
[Worker 10]. Starting task 694.
[Worker 5]. Starting task 677.
[Worker 9]. Starting task 717.
[Worker 7]. Starting task 655.
[Worker 6]. Starting task 750.
[Worker 1]. Starting task 733.
[Worker 0]. Starting task 816.
...
# at the very end ones are much more sluggish
dylkot commented 5 months ago

Yes, I think that is because the later tasks are usually larger values of K which take longer...

I am going to add a feature to resubmit just jobs that failed. Hopefully that can help you get these last iterations finished.

Overall a strategy I'm finding useful is to do K selection with lower numbers of iterations (e.g. 10) and then once you've picked K, doing larger numbers of iterations for the selected values of K.

I hope this helps!