JinmiaoChenLab / SEDR

MIT License
43 stars 9 forks source link

R environment #8

Closed Li-ZhiD closed 1 day ago

Li-ZhiD commented 5 months ago

jupyter kernel shutdown when I run mclust, I checked and found the R environment is:

os.environ['R_HOME'] = '/scbio4/tools/R/R-4.0.3_openblas/R-4.0.3'

but it doesn't work after changing to "/usr".

rocketeer1998 commented 5 months ago

I've resolved this issue on my linux machine. You can have a try. And maybe SEDR's team @Li-ZhiD @HzFu @Xuhang01 can help on fixing this rpy2 bug.

Step1 Use conda rather than pip to install rpy2

conda install --yes rpy2

In comparison with pip install, this can prevent the jupyter kernel from dying while creating an R environment in your conda environment. See reference here

Step2 State R environments on the beginning of your notebook or scripts

import os
os.environ['R_HOME'] = '/mnt/data/tool/miniconda3/envs/SEDR/lib/R'  # your conda env R path 
os.environ['R_USER'] = '/mnt/data/tool/miniconda3/envs/SEDR/lib/python3.11/site-packages/rpy2'  # your conda env path that has installed rpy2
os.environ['R_LIBS'] = '/mnt/data/tool/miniconda3/envs/SEDR/lib/R/library' # your conda env R  library path 

R library path can be modified using .libPaths(path_to_your_lib) in R

Step3 Check whether rpy2 use right R

import rpy2.robjects as ro
r_home = ro.r('R.home()')
r_home

Step4 Modify the mclust_R module in SEDR

def mclust_R(adata, n_clusters, use_rep='SEDR', key_added='SEDR', random_seed=2024):

        """
        Clustering using the mclust algorithm.
        The parameters are the same as those in the R package mclust.
        """
        # import os
        # os.environ['R_HOME'] = '/mnt/data/tool/miniconda3/envs/SEDR'
        # os.environ['R_USER'] = '/mnt/data/tool/miniconda3/envs/SEDR/lib/python3.11/site-packages/rpy2/'
        modelNames = 'EEE'

        np.random.seed(random_seed)
        import rpy2.robjects as robjects
        robjects.r.library("mclust")

        import rpy2.robjects.numpy2ri
        rpy2.robjects.numpy2ri.activate()
        r_random_seed = robjects.r['set.seed']
        r_random_seed(random_seed)
        rmclust = robjects.r['Mclust']

        res = rmclust(rpy2.robjects.numpy2ri.numpy2rpy(adata.obsm[use_rep]), n_clusters, modelNames)
        mclust_res = np.array(res[-2])

        adata.obs[key_added] = mclust_res
        adata.obs[key_added] = adata.obs[key_added].astype('int')
        adata.obs[key_added] = adata.obs[key_added].astype('category')

        return adata

We should comment the first two lines because we won't use it. Hope SEDR team can fix this.

Step5 Run your codes

Now everything can go smoothly. Hope this help.

Lessons that I've learned

  1. If you decide to use rpy2 package, then the first thing to do is conda install rpy2 after creating a new conda environment because conda will prepare all things including R executable that otherwise can't be achieved using pip install rpy2.
  2. When setting os.environ['R_HOME'], we should write ../envs/SEDR/lib/R rather than ../envs/SEDR/bin/R even if you use which R to find your R executable in ../envs/SEDR/bin/R. Otherwise, the kernel will die.
  3. It's a good idea to independently set a new environment for rpy2 to prevent future conflict because this package is so vulnerable.
  4. If you can't use pip in terminal, just refer to this article.

Another questions

  1. What is the meaning of 12 in graph_dict = SEDR.graph_construction(adata, 12)? It's different according to various technologies. Is it a monotonic parameter?
  2. What is the meaning of N in sedr_net.train_with_dec(N=1)
  3. Whats is the rate-limiting step? How to speed up for large datasets?
  4. Why the second dataset in a single notebook running is greatly slower than that of the first dataset?
Xuhang01 commented 5 months ago

@rocketeer1998 Thanks a lot for your assistance. Hope it works for @Li-ZhiD. For your questions:

  1. 12 is the number of neighbors for constructing nearest neighbor graph. This parameter is different for different technologies, because the distribution of spots is different in different technologies. It is not a monotonic parameter. In fact, it is a parameter that shows the smoothing level of graph convolution. It should not be too small or too large. In benchmarking, I found that SEDR works well from k=6 to k=18 (see in Figure S2).
  2. This seems to be an error. I will double check when I am not so busy.
  3. In the first edition of SEDR, it constructs large distance graph, which is the major rate-limiting and memory-costing step. But I have revised SEDR to use sparse matrix for storing distance graph and doing calculation. You can refer to Fig S4, now it does good job on large dataset (60,000 spots).
  4. Could you elaborate this question? I am not sure what is the first and second dataset. I have not observed "greatly slower" when I run SEDR.
rocketeer1998 commented 5 months ago

Thanks for your quick response! For my question4, SEDR runs much slower when I execute the same code to give another run in the same jupyter kernel. I don't know why. Do you have any ideas on how to run SEDR efficiently in a for loop?

Li-ZhiD commented 5 months ago

Thanks a lot! I will try it later.

rocketeer1998 commented 5 months ago

What is the relationship between the number of neighbors (6 or 12 or 18) and computational time?

Xuhang01 commented 5 months ago

@rocketeer1998 I have tested on Slide-seq and it shows that using different number of neighbors will not change computational time. You can also test it on different datasets.

rocketeer1998 commented 4 months ago

@Xuhang01 After these days of testing, I'm still confused why my question 4 exists. To elaborate this issue, I've tested SEDR on my anndata with 5000 cells and 200 genes. In scenario 1, it took me 1 minute to run SEDR pipeline on this data. In scenario 2, it took me 72 minutes to run the same SEDR pipeline on this data though it was proceeded by 5 datasets' running of the same SEDR pipeline in a for loop. Seems like the computational efficiency of the remaining datasets in a for loop will be greatly effected. Do you know why?

Xuhang01 commented 4 months ago

@rocketeer1998 Hi, I have tried to do similar analyses as you described, but do not get the same problem. Could you share your script with me? I cannot guarantee that my code is the same as yours.

edoumazane commented 1 week ago

Hi @Li-ZhiD , @rocketeer1998 and @Xuhang01 (Thank you for sharing your code!)

After installing SEDR on my Linux machine, I listed what I had to do in addition of the instructions. I recently proposed a Pull Request to add those extra steps to the SEDR installation instructions.

I had not seen this issue before, but here are some things might be relevant to it!

mclust R package installation

Here is an alternative strategy to @rocketeer1998 's solution, adapting rpy2 documentation) - just have to do it once, in your environment:

import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector

# set R package names
packnames = ('mclust',)

# import R's utility package
utils = rpackages.importr('utils')
utils.chooseCRANmirror(ind=1) # select the first mirror in the list

# list and install missing packages
packnames_to_install = [x for x in packnames if not rpackages.isinstalled(x)]
if len(packnames_to_install) > 0:
    utils.install_packages(StrVector(packnames_to_install))

Use of conda and environment.yaml file to manage dependencies and automatize environment creation

Using an environment.yaml file that specifies once and for all the dependencies to install is a very handy approach that ensures reproducibilty when creating environments.

# environment.yaml

# Create a new environment: `conda env create -f environment.yaml`
# Update the existing environment: `conda env update -f environment.yaml`

name: SEDR
channels:
  - ...
dependencies:
  - ...

Setting environment variables

One solution by @rocketeer1998 mentions setting new environment variables in the form of import os; os.environ["NEW_ENV_VARIABLE"] = "value" . While I didn't have to set new environment variables in my setup, I'd like to mention another approach that i learned, using conda as seen in conda's documentation:

# In terminal
conda activate SEDR
conda env config vars set NEW_ENV_VARIABLE=value

Create a new environment: conda env create -f environment.yaml

Update the existing environment: conda env update -f environment.yaml

name: SEDR channels:

Hope this helps! 🙂

Xuhang01 commented 1 day ago

@edoumazane Thank you very much!