SGGb0nd / step

Public repo for STEP: Spatial Transcriptomics Embedding Procedure
https://sggb0nd.github.io/step/
Apache License 2.0
12 stars 1 forks source link

how install it with pytorch 2.3.x? #3

Open Mark-wt opened 3 months ago

Mark-wt commented 3 months ago

Hello, i want to install this tool with pytorch 2.3.1, but fails because of the dependency coonflict. So how i install it with pytorch 2.3.x?

SGGb0nd commented 3 months ago

Hi @Mark-wt,

Thank you for bringing this to our attention. I will test the compatibility with PyTorch 2.x (which wasn't released during the dev time of step) and release a new version if necessary. In the meantime, you can clone the repository, modify the pyproject.toml file to match your PyTorch version, and then build and use the package with Poetry.

Steps to Modify and Build the Package with Poetry

  1. Clone the repository:

    git clone https://github.com/SGGb0nd/step.git
    cd step
  2. Modify the pyproject.toml file: Open the pyproject.toml file in a text editor and update the PyTorch dependency to your required version (e.g., 2.3.x).

  3. Build and install the package using Poetry:

    poetry install

For more detailed instructions, you can refer to the Poetry documentation.

Please let me know if you encounter any further issues.

Mark-wt commented 3 months ago

Thanks so mach. It works.

Mark-wt commented 1 month ago

Hello, I have many high-resolution ST slices (visium hd 16um bin) for integration, but it cannot run because of out of cuda memory (80G), so how does it run with multiple GPUs? or what parameters should I adjust without affecting the performance?

SGGb0nd commented 1 month ago

Hi @Mark-wt, you can use set the sample_rate of the .run method to an appropriate integer, e.g., 2048, which represents the number of sampled nodes in each slice in each iteration.

SGGb0nd commented 1 month ago

By the way, I've identified and fixed a bug in the spatial graph construction where the graph only had self-loops. This issue was hindering spatial domain-level analysis, which focuses more on localized and niche areas rather than individual cells. I'll push the fix soon. If your targets prioritize spatial domains, microenvironments, or spatial niches, please keep an eye out for this update.

Mark-wt commented 1 month ago

Hi @Mark-wt, you can use set the of the method to an appropriate integer, e.g., 2048, which represents the number of sampled nodes in each slice in each iteration.sample_rate``.run

Thanks for timely reply. Is it true that the larger the n_samples, the larger the amount of data? what about other parameters, like graph_batch_size=1, n_modules=8, edge_clip=1, n_glayers=4,hidden_dim=30,module_dim=20?

Mark-wt commented 1 month ago

By the way, I've identified and fixed a bug in the spatial graph construction where the graph only had self-loops. This issue was hindering spatial domain-level analysis, which focuses more on localized and niche areas rather than individual cells. I'll push the fix soon. If your targets prioritize spatial domains, microenvironments, or spatial niches, please keep an eye out for this update.

Yes, I focus on spatail domain detection, hoping your update.

SGGb0nd commented 1 month ago

Hi @Mark-wt, you can use set the of the method to an appropriate integer, e.g., 2048, which represents the number of sampled nodes in each slice in each iteration.sample_rate.run ``

Thanks for timely reply. Is it true that the larger the n_samples, the larger the amount of data? what about other parameters, like graph_batch_size=1, n_modules=8, edge_clip=1, n_glayers=4,hidden_dim=30,module_dim=20?

Sorry for the confusing naming of the parameters, n_samples only works for single-slice scenario when there's no batch specified at the initialization of step's interface/object/instance; and when it comes to multi-slice, the number of slices used for training in each iteration is given by graph_batch_size. So, to answer your question, yes, larger n_samples will result in a larger amount of data, i.e., cells or spots, involved in each iteration of the model training. Analogously, when you're trying to integrating multiple slices, the number of cells/spots used for training is given by

  1. graph_batch_size $\times$ sample_rate (when sample_rate is an integer larger than 1). or
  2. the summation of sampled cells/spots from graph_batch_size (if 2, then 2 slices would be sampled) sampled slices with the given sample_rate (when sample_rate is within 0 to 1)

    For the other params, they indeed affect the gpu memory usage since they directly contribute to the amount of model parameters. However, I recommend tuning the two above parameters, graph_batch_size and sample_rate, as well as n_iterations, because I've successfully run step on 26 newly released merfish sagittal slices (millions of cells in total, graph_batch_size=5, sample_rate=3000, n_iterations=4000) and gotten descent spatial domains on a single v100 gpu.

Mark-wt commented 1 month ago

Hi @Mark-wt, you can use set the of the method to an appropriate integer, e.g., 2048, which represents the number of sampled nodes in each slice in each iteration..run `` sample_rate

Thanks for timely reply. Is it true that the larger the n_samples, the larger the amount of data? what about other parameters, like graph_batch_size=1, n_modules=8, edge_clip=1, n_glayers=4,hidden_dim=30,module_dim=20?

Sorry for the confusing naming of the parameters, only works for single-slice scenario when there's no specified at the initialization of step's interface/object/instance; and when it comes to multi-slice, the number of slices used for training in each iteration is given by . So, to answer your question, yes, larger will result in a larger amount of data, i.e., cells or spots, involved in each iteration of the model training. Analogously, when you're trying to integrating multiple slices, the number of cells/spots used for training is given byn_samples``batch``graph_batch_size``n_samples

  1. graph_batch_size × sample_rate (when is an integer larger than 1). orsample_rate
  2. the summation of sampled cells/spots from (if 2, then 2 slices would be sampled) sampled slices with the given (when is within 0 to 1)graph_batch_size``sample_rate``sample_rate

For the other params, they indeed affect the gpu memory usage since they directly contribute to the amount of model parameters. However, I recommend tuning the two above parameters, and , as well as , because I've successfully run on 26 newly released merfish sagittal slices (millions of cells in total, ) and gotten descent spatial domains on a single v100 gpu.graph_batch_size``sample_rate``n_iterations``step``graph_batch_size=5, sample_rate=3000, n_iterations=4000

Thanks very much. It can run now after setting these two parameters following your advice.

SGGb0nd commented 1 month ago

good to hear. You can also leave n_modules, hidden_dim and module_dim just as default for a larger model capacity.

Mark-wt commented 1 month ago

good to hear. You can also leave n_modules, hidden_dim and module_dim just as default for a larger model capacity.

yeah, it also works

SGGb0nd commented 2 weeks ago

@Mark-wt Hi Mark, I've pushed the fix for the spatial graph construction bug I mentioned earlier (where graphs only had self-loops). The update should now properly handle spatial domain-level analysis for large scale datasets.

Mark-wt commented 2 weeks ago

@Mark-wt Hi Mark, I've pushed the fix for the spatial graph construction bug I mentioned earlier (where graphs only had self-loops). The update should now properly handle spatial domain-level analysis for large scale datasets.

Thanks very much for announcing.