cistrome / MIRA

Python package for analysis of multiomic single cell RNA-seq and ATAC-seq.
52 stars 7 forks source link

Hyperparameter tuning is too slow #1

Closed longfeili5170 closed 2 years ago

longfeili5170 commented 2 years ago

Hi! I am trying MIRA in my own dataset, but I have spent a long time in step 'Hyperparameter tuning' and still have no results. I am running the code on the gpu server. Did I forget to set something that caused the slowness to run? Thanks!

longfeili5170 commented 2 years ago

I checked my gpu process and confirmed that the program did not call gpu. What do I need to set to call gpu?

AllenWLynch commented 2 years ago

Hi,

  1. What version # are you using?
  2. How many cells are in your dataset?
  3. And about how long is each model taking to evaluate?

The hyperparameter tuning step will generally take a good bit of time while MIRA finds the best number of topics to describe your dataset, but each model will evaluate quickly.

By default, MIRA will try to use a GPU if one is available. You can check to see if pytorch is using the GPU by running these commands: https://stackoverflow.com/questions/48152674/how-to-check-if-pytorch-is-using-the-gpu

longfeili5170 commented 2 years ago

Cudatoolkit is not compatible with pytorch. The problem of calling gpu is solved by updating cudatoolkit. Now I run hyperparameter by calling the GPU. The version of mira is 0.09. There are 32,000 cells in my dataset.I spent six hours to get the following result(Still running).

屏幕快照 2021-12-14 下午9 32 18
longfeili5170 commented 2 years ago

I also have some questions about atac's hyperparameter.Is my code correct?

屏幕快照 2021-12-14 下午10 13 30

I have a total of 18,000 peaks, which resulted in the following error:

屏幕快照 2021-12-14 下午10 12 03

Do I need to do some processing on ATAC data? Thanks!

AllenWLynch commented 2 years ago

Hi, Good you should be running faster with a GPU now. 32,000 cells is a good-sized dataset so this will take a good bit of time. The time for hyperparameter tuning is time well spent, though, since the topics will help you understand the dynamics of your dataset and you want them to reflect true biological sources of covariation and co-regulation. I like to run these steps overnight. You can stop the tuning whenever you like, and then running the next steps.

There are ways you could speed this up. First, instead of five-fold cross validation for each trial, you can provide a lower value to the "cv" parameter in the TopicModelTuner object. For instance, 2. You can also provide your own sklearn.model_selection object to the "cv" parameter. You could provide a "ShuffleSplit" object with n_splits as 1 to completely skip cross-validation. With 32000 cells, the variance on your model performance estimates may be low enough for you to try the options I gave above.

Also, when you see that CUDA memory error, it's best to just shut down the notebook instance and start a new one. Occasionally, poor garbage collection from pytorch will overload the GPU memory after training a lot of models.

For ATAC-seq MIRA will handle binarization within the model so you don't need to preprocess, and that code looks good!

longfeili5170 commented 2 years ago

Thanks again! :)