dattalab / keypoint-moseq

https://keypoint-moseq.readthedocs.io
Other
64 stars 26 forks source link

CUDA Illegal Memory Access Issue #87

Closed sean-afshar closed 11 months ago

sean-afshar commented 11 months ago

I have recently run into some difficulties applying kpms 0.2.2 on a large 3D pose dataset consisting of approximately 50 million keypoints. When I try to fit the full model after completing ARHMM fitting, I encounter an illegal memory access error. Furthermore, this error arises at different epochs when training. I never experienced this particular issue when working with 2D datasets of the same magnitude.

I was able to successfully train once on approximately 50% of my 3D data, but was not able to consistently replicate these results. I was only successful the first time after turning off memory pre-allocation. For reference, I am running kpms 0.2.2 (pip build) on a Windows 10 machine with CUDA 11.1 and cudNN 8.2. The 3D data was generated by sleap-anipose, and the error I have encountered has been consistent when I have formatted the data with both the sleap-anipose loading option and manually by generating the coordinates dictionary.

Given the error trace below, it is possible that I might be running out of GPU memory. However, I was able to complete training on this complete dataset in kpms 0.1.3 by restarting my kernel after the ARHMM fitting and loading the model. That has not been the case in 0.2.2.

Thank you guys for your continued support.

ValueError                                Traceback (most recent call last)
Cell In[17], line 7
      4 model, data, metadata, current_iter = kpms.load_checkpoint(
      5     project_dir, model_name=name, iteration=n_ar_iters)
      6 model = kpms.update_hypparams(model, kappa=kappa)
----> 7 model = kpms.fit_model(
      8     model, data, metadata, project_dir, name, ar_only=False, 
      9     start_iter=current_iter, num_iters=current_iter+extra_iters,parallel_message_passing=False)[0]

File C:\Miniconda3\envs\keypoint_moseq\lib\site-packages\keypoint_moseq\fitting.py:178, in fit_model(model, data, metadata, project_dir, model_name, num_iters, start_iter, verbose, ar_only, save_every_n_iters, generate_progress_plots, parallel_message_passing, **kwargs)
    176 for iteration in pbar:
    177     try:
--> 178         model = _wrapped_resample(
    179             data,
    180             model,
    181             pbar=pbar,
    182             ar_only=ar_only,
    183             verbose=verbose,
    184             parallel_message_passing=parallel_message_passing,
    185         )
    186     except StopResampling:
    187         break

File C:\Miniconda3\envs\keypoint_moseq\lib\site-packages\keypoint_moseq\fitting.py:21, in _wrapped_resample(data, model, pbar, **resample_options)
     19 def _wrapped_resample(data, model, pbar=None, **resample_options):
     20     try:
---> 21         model = resample_model(data, **model, **resample_options)
     22     except KeyboardInterrupt:
     23         print("Early termination of fitting: user interruption")

File C:\Miniconda3\envs\keypoint_moseq\lib\site-packages\jax_moseq\models\keypoint_slds\gibbs.py:461, in resample_model(data, seed, states, params, hypparams, noise_prior, ar_only, states_only, resample_global_noise_scale, resample_local_noise_scale, fix_heading, verbose, jitter, parallel_message_passing, **kwargs)
    459     if verbose:
    460         print("Resampling h (heading)")
--> 461     states["h"] = resample_heading(seed, **data, **states, **params)
    463 if verbose:
    464     print("Resampling v (location)")

ValueError: INTERNAL: Failed to launch CUDA kernel: slice_113 with block dimensions: 1x1x1 and grid dimensions: 1x1x1: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
calebweinreb commented 11 months ago

Thanks for posting! Looks like one solution is to upgrade to CUDA 11.2. Higher version may also work.