ComputationalCryoEM / ASPIRE-Python

Algorithms for Single Particle Reconstruction
http://spr.math.princeton.edu
GNU General Public License v3.0
45 stars 21 forks source link

Error when run aspire.cov3d with cupy #1115

Closed ThinkJanice closed 1 month ago

ThinkJanice commented 1 month ago

Describe the bug: Dear ASPIRE-Python Development Team, I am specifically working with the cov3d module from the ASPIRE package. However, after modifying the config.yaml file as shown below,

截屏2024-05-08 下午3 57 18

I encountered the following error.

截屏2024-05-08 下午3 57 08

Despite extensive research and troubleshooting attempts online, I have not found a viable solution to this issue. I apologize for the direct approach, but I would greatly appreciate any guidance or insights you could provide on this matter. Your support would be incredibly helpful to my research.

To Reproduce: import logging import numpy as np from scipy.cluster.vq import kmeans2 import starfile import pandas as pd import os

from aspire.basis import FFBBasis3D from aspire.covariance import CovarianceEstimator from aspire.noise import AnisotropicNoiseEstimator, WhiteNoiseEstimator from aspire.reconstruction import MeanEstimator from aspire.source.relion import RelionSource from aspire.utils import eigs from aspire.source.simulation import Simulation from aspire.denoising import src_wiener_coords from aspire.volume import Volume from aspire.utils.random import Random

from aspire.utils import RelionStarFile from aspire.operators import CTFFilter

logger = logging.getLogger(name)

Set input path and files and initialize other parameters

DATA_FOLDER = "xx" STARFILE = "xx" PIXEL_SIZE = 0.85 MAX_ROWS = None MAX_RESOLUTION = 16 CG_TOL = 1e-5 num_vols = 2 batchsize = 500

Set number of eigen-vectors to keep

NUM_EIGS = 32

Create a source object for experimental 2D images with estimated rotation angles, relionsource only cpu

print(f"Read in images from {STARFILE} and preprocess the images.") source = RelionSource( STARFILE, data_folder=DATA_FOLDER, pixel_size=PIXEL_SIZE, max_rows=MAX_ROWS )

source.images[:].asnumpy() = source.images[:].asnumpy().astype(np.float32)

print("The dtype of source is", source.dtype)

Downsample the images

print(f"Set the resolution to {MAX_RESOLUTION} X {MAX_RESOLUTION}") if MAX_RESOLUTION < source.L: source = source.downsample(MAX_RESOLUTION)

Specify the fast FB basis method for expanding the 2D images

basis = FFBBasis3D((MAX_RESOLUTION, MAX_RESOLUTION, MAX_RESOLUTION), dtype=source.dtype) print("The dtype of basis is", basis.dtype)

Estimate the noise of images

print("Estimate the noise of images using AnisotropicNoiseEstimator") noise_estimator = AnisotropicNoiseEstimator(source, batchSize=batchsize) #check also

Whiten the noise of images

print("Whiten the noise of images from the noise estimator") source = source.whiten(noise_estimator)

for l in range(0, batch_num):

source = np.load('{}{:04}_rwts_mat_l.npy'.format(ffbbasis_out,l), allow_pickle=True).astype(dtype)

Estimate the noise variance. This is needed for the covariance estimation step below.

noise_variance = noise_estimator.estimate() print(f"Noise Variance = {noise_variance}")

mean_estimator = MeanEstimator(source, basis = basis, batch_size=batchsize) # adjust the batch_size mean_est = mean_estimator.estimate()

Passing in a mean_kernel argument to the following constructor speeds up some calculations

covar_estimator = CovarianceEstimator(source, basis = basis, mean_kernel=mean_estimator.kernel) covar_est = covar_estimator.estimate(mean_est, noise_variance, tol=CG_TOL)

Extract the top eigenvectors and eigenvalues of the covariance estimate.

eigs_est, lambdas_est = eigs(covar_est, NUM_EIGS) for i in range(NUM_EIGS): print(f"Top {i}th eigen value: {lambdas_est[i, i]}")

Expected behavior: I hope to finish 3d mean and covariance estimation

Environment (please complete the following information):

Additional context: Add any other context about the problem here.

garrettwrong commented 1 month ago

Hi, thanks for your interest in ASPIRE. I'm not really able to follow your script or make out exactly what is happening from the log screenshot.

I would note that you are not using the most version of ASPIRE (12.2). A newer version is required for CUDA 12. I'm not sure what version you are attempting to run with here, or if you are reasonably installed... The current release and/develop versions are going to have the best support.

The closest script I have available to run is our cov3d demo, which looks to be where your script started from. I confirmed the following works on a machine with CUDA 12.3, and is automatically using the GPU for any available components with no user configuration required.

git clone git@github.com:ComputationalCryoEM/ASPIRE-Python ASPIRE-Python.main
cd ASPIRE-Python.main
conda create -n 1115 python=3.8 -y
pip install -e ".[dev,gpu-12x]"
python gallery/tutorials/tutorials/cov3d_simulation.py

I can generate errors if I manually override the config as you have. In that case the override is forcing portions of the code to run via cupy that have not actually been extended yet. As I mention above, you should get available GPU extensions automatically without changing the configuration. You may confirm this by noting the use of cufinufft in the logs, and also by invokingnvidia-smi from another shell session while the script it running to see the python process using the GPU.

Let me know if that works for you, thanks.

ThinkJanice commented 1 month ago

Hi, thank you for your response. I used the following command to reinstall the environment on the HPC

image

and there were no errors. However, when running the code, I encountered the following problem

截屏2024-05-09 下午7 33 18

which seems to be caused by an incorrect CUDA driver. I am not sure how to resolve this issue yet.

garrettwrong commented 1 month ago

Hi there, looks like we're closer :).

I don't believe that is an error within ASPIRE, and unfortunately with a customized HPC environment it is going to be very difficult for me to precisely debug other softwares purely from the error messages that bubble up to ASPIRE. You are likely correct that you have a driver/environment issue and it might be more prudent to work with your system administrators and test that all the dependent software are actually functioning correctly and passing their own tests (ie pycuda, cupy, cufinufft). Most likely that error is generated from cufinufft being run on a machine with a mismatch between the driver and the tools used for building, though this could happen with the other packages too. You can find the driver version installed on the machine by using the command nvidia-smi. If it does not meet minimal requirements, ie it is behind 12.x, your options are to build the impacted dependent packages from source (possibly with patches), or if it makes more sense, have a newer driver installed.

The closest HPC machine I have access to is going to be Princeton's Della, and I confirmed that this code functions on Della yesterday with the following environment.

module load anaconda3/2024.2
module load gcc/8
module load cudatoolkit/12.3
$ nvidia-smi 
Thu May  9 08:34:48 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+

The other machine I tested yesterday is at:

nvidia-smi
Thu May  9 08:35:38 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+

To be honest, the performance improvement using the GPU for that specific area of code is not great at this time, so it might not be worth the hassle over the host (threaded) version.

ThinkJanice commented 1 month ago

Yep, I can successfully run cufinufft on my own workstation, but when I try to run cufinufft within ASPIRE on the HPC, I encounter the error mentioned earlier.

I have a question: does the cufinufft in ASPIRE automatically use all available GPUs, or does it use just one GPU at a time?

garrettwrong commented 1 month ago

Sounds like your cufinufft HPC trouble is more likely from the cufinufft package not working on that system than ASPIRE. Perhaps your system's support or FI support can help you with cufinufft directly. Do you know if any of their (cufinufft) tests are failing on that system?..

I believe any ASPIRE calls to cufinufft will only use one GPU. In this case, performance is constrained by overhead and transfer, not because a single GPU is fully occupied.

ThinkJanice commented 1 month ago

Hi!!!!

The CUDA issue has been resolved. Previously, using the GeForce RTX 2080 Ti caused errors, but now with the Tesla V100-PCIE-32GB, it runs successfully.

Thanksfor your guidance!

garrettwrong commented 1 month ago

Glad to hear it. Sounds like there must have been some architecture dependent CUDA code in there. Thanks for following up. Closing.