dpeerlab / SEACells

SEACells algorithm for Inference of transcriptional and epigenomic cellular states from single-cell genomics data
GNU General Public License v2.0
145 stars 27 forks source link

model.fit hangs indefinitely #18

Open smorabit opened 2 years ago

smorabit commented 2 years ago

HI,

I was able to run this tutorial on the provided sample data. However, I am running into problems when using my own dataset (~58k cells and ~30k genes). Here is my SEACells code:


# they recommend one metacell for every 75 real cells
n_SEACells = int(np.floor(adata.obs.shape[0] / 75))

#build_kernel_on = 'X_pca' # key in ad.obsm to use for computing metacells
build_kernel_on = 'X_harmony' # key in ad.obsm to use for computing metacells
                          # This would be replaced by 'X_svd' for ATAC data

## Additional parameters
n_waypoint_eigs = 10 # Number of eigenvalues to consider when initializing metacells
waypoint_proportion = 0.9 # Proportion of metacells to initialize using waypoint analysis,
                        # the remainder of cells are selected by greedy selection

model = SEACells.core.SEACells(adata,
                  build_kernel_on=build_kernel_on,
                  n_SEACells=n_SEACells,
                  n_waypoint_eigs=n_waypoint_eigs,
                  waypt_proportion=waypoint_proportion,
                  convergence_epsilon = 1e-5)

# Initialize archetypes
model.initialize_archetypes()

model.fit(n_iter=20)
Randomly initialized A matrix.
Setting convergence threshold at 0.05610640134547047
Starting iteration 1.
Completed iteration 1.

It hangs up after completing iteration 1 for several hours, and eventually I killed it. #7 references model.fit taking too long but this seems like a different issue. Please let me know if you have any insights!

ManuSetty commented 2 years ago

Hi there - Sorry for the delayed response. I wonder if your compute is running out of memory? We will shortly release a GPU version which should solve this too.

smorabit commented 2 years ago

I was running it on my laptop with only 16 gb of memory so might have been a memory issue. Do you have a sense for how much memory is recommended, or how memory usage scales with the size of the input dataset?

BingjieZhang commented 1 year ago

Having the same problem....I was running it on lab sever with 200G memory with 17K cells. Any idea about how I could proceed?

Sayyam-Shah commented 1 year ago

Thank you for the amazing tool! I'm excited to implement seacells in my workflow; however, I'm experiencing the same problem with 30k cells on the HPC cluster with 184 GB.

sitarapersad commented 1 year ago

This is a bit curious, as we were able to run 120k cells on a cluster with 200GB of memory, so memory-wise, you should be fine. It may be slow to run, though - 75k cells took 18 hours and 50k cells took 8 hours in our updated benchmarks. How long did you let the process run before killing it?

Sayyam-Shah commented 1 year ago

That makes sense, thank you! I let it run for an hour before killing it since it did not complete a single iteration. I did not expect it to take 8+ hours. I'll submit the job again while supplying more time.

Thank you!

sitarapersad commented 1 year ago

Gotcha - yes sorry it only prints every 10 iterations or so, which might be confusing! Let me know if it still does not work. A GPU sped up version is in the works to ease some of the runtime!

Sayyam-Shah commented 1 year ago

That's amazing! I look forward to the GPU version.

Sayyam-Shah commented 1 year ago

Hello @sitarapersad,

I hope you are having a great day and thank you for the interesting tool!

I submitted a job to compute ATAC seacells for my dataset of 30k cells. However, it has not completed the 10 iterations mark after 3 days on the HPC cluster. I supplied the job with 300 GB and 10 cores. Would you recommend I kill the job, increase the resources, or potentially wait for the GPU version to be released? There is 1 day and 21 hours left on the job. This is unexpected, especially since it took you 18 hours for 75k cells.

Below is the output so far from the job.

findfont: Font family ['Raleway'] not found. Falling back to DejaVu Sans.
findfont: Font family ['Lato'] not found. Falling back to DejaVu Sans.
/cluster/home/.local/lib/python3.7/site-packages/palantir/utils.py:56: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
  temp = sc.AnnData(data_df.values)
Computing kNN graph using scanpy NN ...
Computing radius for adaptive bandwidth kernel...
  0%|          | 0/30980 [00:00<?, ?it/s]
Making graph symmetric...
Computing RBF kernel...
  0%|          | 0/30980 [00:00<?, ?it/s]
Building similarity LIL matrix...
  0%|          | 0/30980 [00:00<?, ?it/s]
Constructing CSR matrix...
Building kernel on X_svd
Computing diffusion components from X_svd for waypoint initialization ...
Determing nearest neighbor graph...
Done.
Sampling waypoints ...
Done.
Selecting 4817 cells from waypoint initialization.
Initializing residual matrix using greedy column selection
Initializing f and g...
100%|██████████| 5193/5193 [1:56:36<00:00,  1.35s/it]

Script:

## Core parameters
n_SEACells = 10000
build_kernel_on = 'X_svd' # key in ad.obsm to use for computing metacells
                          # This would be replaced by 'X_svd' for ATAC data

## Additional parameters
n_waypoint_eigs = 10 # Number of eigenvalues to consider when initializing metacells

model = SEACells.core.SEACells(adatac,
                  build_kernel_on=build_kernel_on,
                  n_SEACells=n_SEACells,
                  n_waypoint_eigs=n_waypoint_eigs,
                  convergence_epsilon = 1e-5)

model.construct_kernel_matrix()
M = model.kernel_matrix

model.initialize_archetypes()

model.fit(min_iter=10, max_iter=50)

SEACell_soft_ad = SEACells.core.summarize_by_soft_SEACell(adatac, model.A_, celltype_label='CellType_Annotation_formatted',summarize_layer='raw', minimum_weight=0.05)
sitarapersad commented 1 year ago

Hi Sayyam! I think this is likely because there are too many SEACells. We recommend about 1 SEACell for every 100 cells, so for your dataset, something on the order of 300 SEACells (rather than 10k). At 10k it'll be very slow and the result will likely still be too sparse to gain robust states.

Sayyam-Shah commented 1 year ago

Hello!

That makes sense! Thank you. I'll try that!

In #22, it was recommended to compute RNA metacells using the ATAC seacells. May you please explain the reasoning behind the recommendation since, according to my understanding, it is harder to identify robust cell states in ATAC because of the sparsity?

I'm thinking of computing the ATAC seacells and then the RNA seacells using SEACells.genescores.prepare_multiome_anndata. From there, I'll feed both modalities into scenic+ to identify TFs.

YH-Zheng commented 1 year ago

Hello, currently I have a very large single-cell RNA dataset with more than 5 million cells. I tested the seacell pipeline on some of the data and found that as other comments said, the running speed is very slow. We expect to use seacell to find the optimal metacell to reduce the amount of data in the entire project, but the time spent in this step cannot be estimated at present. I noticed that the GPU version is already available? How does the running speed compare to the cpu version? Or do you have a recommended workflow for such a large dataset? Thanks again for the excellent work you do.