PSims / BayesEoR

Code to estimate the power spectrum of redshifted 21-cm emission from interferometric observations, within a Bayesian forward modelling framework.
https://bayeseor.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

[joss-review] Error when running tests #31

Open musoke opened 1 month ago

musoke commented 1 month ago

https://github.com/openjournals/joss-reviews/issues/6667

I have installed the package and attempted to run the tests described in the docs (https://bayeseor.readthedocs.io/en/latest/usage.html#test-dataset)

Running

python run-analysis.py --config example-config.yaml --cpu
python run-analysis.py --config example-config.yaml --gpu

with the default example-config.yaml results in an error:

(bayeseor) [nathan@host BayesEoR] $ python run-analysis.py --config example-config.yaml --gpu

mpi_size: 1

╭────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Parameters                                                                                             │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────╯
{'achromatic_beam': False,
 'antenna_diameter': None,
 'array_dir_prefix': './array-storage/',
 'bandwidth_MHz': 9.029521500604,
 'beam_center': None,
 'beam_peak_amplitude': 1.0,
 'beam_ref_freq': None,
 'beam_type': 'gaussian',
 'beta': [2.63, 2.82],
 'central_jd': 2458098.3065661727,
 'channel_width_MHz': 0.237618986858,
 'clobber': False,
 'config': [Path_fr(example-config.yaml, cwd=/home/nathan/src/BayesEoR)],
 'cosfreq': None,
 'data_path': Path_fr(./test_data/visibilities.npy, cwd=/home/nathan/src/BayesEoR),
 'deta': 1.1074783973138646e-07,
 'drift_scan': True,
 'du_eor': 4.438755506838748,
 'du_fg': 4.438755506838748,
 'dv_eor': 4.438755506838748,
 'dv_fg': 4.438755506838748,
 'eor_random_seed': 892736,
 'eor_sim_path': None,
 'file_root': None,
 'fit_for_monopole': False,
 'fit_for_shg_amps': False,
 'fit_for_spectral_model_parameters': False,
 'fov_dec_eor': 12.9080728652,
 'fov_dec_fg': 12.9080728652,
 'fov_ra_eor': 12.9080728652,
 'fov_ra_fg': 12.9080728652,
 'freqs_MHz': array([158.30404874, 158.54166773, 158.77928672, 159.0169057 ,
       159.25452469, 159.49214368, 159.72976266, 159.96738165,
       160.20500064, 160.44261962, 160.68023861, 160.9178576 ,
       161.15547659, 161.39309557, 161.63071456, 161.86833355,
       162.10595253, 162.34357152, 162.58119051, 162.81880949,
       163.05642848, 163.29404747, 163.53166645, 163.76928544,
       164.00690443, 164.24452341, 164.4821424 , 164.71976139,
       164.95738038, 165.19499936, 165.43261835, 165.67023734,
       165.90785632, 166.14547531, 166.3830943 , 166.62071328,
       166.85833227, 167.09595126]),
 'fwhm_deg': 9.306821090681533,
 'include_instrumental_effects': True,
 'inst_model': Path_dr(./test_data/, cwd=/home/nathan/src/BayesEoR),
 'integration_time_seconds': 11.0,
 'inverse_LW_power': 1e-16,
 'log_priors': True,
 'neta': 38,
 'nf': 38,
 'noise_data_path': None,
 'noise_seed': 742123,
 'npl': 2,
 'npl_sh': None,
 'nq': 0,
 'nq_sh': None,
 'nside': 128,
 'nt': 34,
 'nu': 15,
 'nu_fg': 15,
 'nu_min_MHz': 158.304048743,
 'nu_sh': None,
 'nuv': 224,
 'nuv_fg': 224,
 'nuv_sh': None,
 'nv': 15,
 'nv_fg': 15,
 'nv_sh': None,
 'output_dir': './chains/',
 'pl_grid_spacing': None,
 'pl_max': None,
 'pl_min': None,
 'priors': [[-2.0, 2.0],
            [-1.2, 2.8],
            [-0.7, 3.3],
            [0.7, 2.7],
            [1.1, 3.1],
            [1.5, 3.5],
            [2.0, 4.0],
            [2.4, 4.4],
            [2.7, 4.7]],
 'ps_box_size_dec_Mpc': 2039.0643396379605,
 'ps_box_size_para_Mpc': 148.80444173068133,
 'ps_box_size_ra_Mpc': 2039.0643396379605,
 'redshift': 7.7302135941678465,
 'sigma': 0.00615864342588761,
 'simple_za_filter': True,
 'single_node': False,
 'speed_of_light': 299792458.0,
 'taper_func': None,
 'telescope_latlonalt': [-30.72152777777791,
                         21.428305555555557,
                         1073.0000000093132],
 'uprior_bins': '',
 'useGPU': True,
 'use_LWM_Gaussian_prior': False,
 'use_Multinest': True,
 'use_intrinsic_noise_fitting': False,
 'use_shg': False,
 'use_sparse_matrices': True,
 'verbose': False}

Output directory: /home/nathan/src/BayesEoR/chains/MN-Test-15-15-38-0-2-6.2E-03-2.63-2.82-lp-dPS-v1

╭────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Matrices                                                                                               │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Array save directory: array-storage/nu-15-nv-15-neta-38-sigma-6.16E-03-nside-128-fov-deg-12.9-za-filter-nq-0/test_data-gaussian-beam-fwhm-9.3068deg-dspb/
Matrix stack complete

---Calculating k-vals---
0 0.04618764702082546
1 0.08652193815249166
2 0.12806815323513662
3 0.16994735208289397
4 0.23300561988455756
5 0.3172473331596665
6 0.44376262151460144
7 0.6125469140767436
8 0.7518323962451139

╭────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Data and Noise                                                                                         │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Using data at ./test_data/visibilities.npy

Generating noise:
Seeding numpy.random with 742123

Hermitian symmetry checks:
signal is Hermitian: True
signal + noise is Hermitian: True

SNR:
Stddev(signal) = 3.0068e-03
Stddev(noise) = 6.1603e-03
SNR = 4.8809e-01

╭────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Posterior                                                                                              │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────╯
priors = [[-2.0, 2.0], [-1.2, 2.8], [-0.7, 3.3], [0.7, 2.7], [1.1, 3.1], [1.5, 3.5], [2.0, 4.0], [2.4, 4.4], [2.7, 4.7]]

Instantiating posterior class:
Using log-priors
Calculating dimensionless_PS
Setting inverse_LW_power to 1e-16
Loading shared library from /home/nathan/anaconda3/envs/bayeseor/lib/libmagma.so
Computing on GPU(s)

                    MPI size == 1, analysis will only be run with --single-node flag.

Skipping sampling, exiting...

The contents of `./chains/ is

(bayeseor) [nathan@larb BayesEoR] $ tree chains/
chains/
└── MN-Test-15-15-38-0-2-6.2E-03-2.63-2.82-lp-dPS-v1
    ├── k-vals-bins.txt
    ├── k-vals-nsamples.txt
    └── k-vals.txt

1 directory, 3 files
jburba commented 4 weeks ago

Hi @musoke , thanks for pointing this out. I've added another flag to the example-config.py file which will run the analysis if you're using a single MPI process in my latest commit 695e361. That toggle is controlled by the --single-node command line argument or single_node in the configuration yaml. Please try running

python run-analysis.py --config example-config.yaml --gpu

again after pulling the latest version of main.

musoke commented 2 weeks ago

Hi @jburba, thanks for adding that flag. I have tried again running with the current main, but now get a different error:

(bayeseor) [nathan@larb BayesEoR] $ python run-analysis.py --config example-config.yaml --gpu

mpi_size: 1

╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Parameters                                                                                                                                                                                                        │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
{'achromatic_beam': False,
 'antenna_diameter': None,
 'array_dir_prefix': './array-storage/',
 'bandwidth_MHz': 9.029521500604,
 'beam_center': None,
 'beam_peak_amplitude': 1.0,
 'beam_ref_freq': None,
 'beam_type': 'gaussian',
 'beta': [2.63, 2.82],
 'central_jd': 2458098.3065661727,
 'channel_width_MHz': 0.237618986858,
 'clobber': False,
 'config': [Path_fr(example-config.yaml, cwd=/home/nathan/src/BayesEoR)],
 'cosfreq': None,
 'data_path': Path_fr(./test_data/visibilities.npy, cwd=/home/nathan/src/BayesEoR),
 'deta': 1.1074783973138646e-07,
 'drift_scan': True,
 'du_eor': 4.438755506838748,
 'du_fg': 4.438755506838748,
 'dv_eor': 4.438755506838748,
 'dv_fg': 4.438755506838748,
 'eor_random_seed': 892736,
 'eor_sim_path': None,
 'file_root': None,
 'fit_for_monopole': False,
 'fit_for_shg_amps': False,
 'fit_for_spectral_model_parameters': False,
 'fov_dec_eor': 12.9080728652,
 'fov_dec_fg': 12.9080728652,
 'fov_ra_eor': 12.9080728652,
 'fov_ra_fg': 12.9080728652,
 'freqs_MHz': array([158.30404874, 158.54166773, 158.77928672, 159.0169057 ,
       159.25452469, 159.49214368, 159.72976266, 159.96738165,
       160.20500064, 160.44261962, 160.68023861, 160.9178576 ,
       161.15547659, 161.39309557, 161.63071456, 161.86833355,
       162.10595253, 162.34357152, 162.58119051, 162.81880949,
       163.05642848, 163.29404747, 163.53166645, 163.76928544,
       164.00690443, 164.24452341, 164.4821424 , 164.71976139,
       164.95738038, 165.19499936, 165.43261835, 165.67023734,
       165.90785632, 166.14547531, 166.3830943 , 166.62071328,
       166.85833227, 167.09595126]),
 'fwhm_deg': 9.306821090681533,
 'include_instrumental_effects': True,
 'inst_model': Path_dr(./test_data/, cwd=/home/nathan/src/BayesEoR),
 'integration_time_seconds': 11.0,
 'inverse_LW_power': 1e-16,
 'log_priors': True,
 'neta': 38,
 'nf': 38,
 'noise_data_path': None,
 'noise_seed': 742123,
 'npl': 2,
 'npl_sh': None,
 'nq': 0,
 'nq_sh': None,
 'nside': 128,
 'nt': 34,
 'nu': 15,
 'nu_fg': 15,
 'nu_min_MHz': 158.304048743,
 'nu_sh': None,
 'nuv': 224,
 'nuv_fg': 224,
 'nuv_sh': None,
 'nv': 15,
 'nv_fg': 15,
 'nv_sh': None,
 'output_dir': './chains/',
 'pl_grid_spacing': None,
 'pl_max': None,
 'pl_min': None,
 'priors': [[-2.0, 2.0],
            [-1.2, 2.8],
            [-0.7, 3.3],
            [0.7, 2.7],
            [1.1, 3.1],
            [1.5, 3.5],
            [2.0, 4.0],
            [2.4, 4.4],
            [2.7, 4.7]],
 'ps_box_size_dec_Mpc': 2039.0643396379605,
 'ps_box_size_para_Mpc': 148.80444173068133,
 'ps_box_size_ra_Mpc': 2039.0643396379605,
 'redshift': 7.7302135941678465,
 'sigma': 0.00615864342588761,
 'simple_za_filter': True,
 'single_node': True,
 'speed_of_light': 299792458.0,
 'taper_func': None,
 'telescope_latlonalt': [-30.72152777777791,
                         21.428305555555557,
                         1073.0000000093132],
 'uprior_bins': '',
 'useGPU': True,
 'use_LWM_Gaussian_prior': False,
 'use_Multinest': True,
 'use_intrinsic_noise_fitting': False,
 'use_shg': False,
 'use_sparse_matrices': True,
 'verbose': False}

Output directory: /home/nathan/src/BayesEoR/chains/MN-Test-15-15-38-0-2-6.2E-03-2.63-2.82-lp-dPS-v7

╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Matrices                                                                                                                                                                                                          │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Array save directory: array-storage/nu-15-nv-15-neta-38-sigma-6.16E-03-nside-128-fov-deg-12.9-za-filter-nq-0/test_data-gaussian-beam-fwhm-9.3068deg-dspb/
Matrix stack complete

---Calculating k-vals---
0 0.04618764702082546
1 0.08652193815249166
2 0.12806815323513662
3 0.16994735208289397
4 0.23300561988455756
5 0.3172473331596665
6 0.44376262151460144
7 0.6125469140767436
8 0.7518323962451139

╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Data and Noise                                                                                                                                                                                                    │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Using data at ./test_data/visibilities.npy

Generating noise:
Seeding numpy.random with 742123

Hermitian symmetry checks:
signal is Hermitian: True
signal + noise is Hermitian: True

SNR:
Stddev(signal) = 3.0068e-03
Stddev(noise) = 6.1603e-03
SNR = 4.8809e-01

╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Posterior                                                                                                                                                                                                         │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
priors = [[-2.0, 2.0], [-1.2, 2.8], [-0.7, 3.3], [0.7, 2.7], [1.1, 3.1], [1.5, 3.5], [2.0, 4.0], [2.4, 4.4], [2.7, 4.7]]

Instantiating posterior class:
Using log-priors
Calculating dimensionless_PS
Setting inverse_LW_power to 1e-16
Loading shared library from /home/nathan/anaconda3/envs/bayeseor/lib/libmagma.so
Computing on GPU(s)

Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
0 : GPU inversion error. Setting sample posterior probability to zero.
0 : Param values:  [10. 10. 10. 10. 10. 10. 10. 10. 10.]
0 : info = 4294967183
                                                                             WARNING: Infinite value returned in posterior calculation!
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
0 : GPU inversion error. Setting sample posterior probability to zero.
0 : Param values:  [10. 10. 10. 10. 10. 10. 10. 10. 10.]
0 : info = 4294967183
                                                                             WARNING: Infinite value returned in posterior calculation!
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
0 : GPU inversion error. Setting sample posterior probability to zero.
0 : Param values:  [10. 10. 10. 10. 10. 10. 10. 10. 10.]
0 : info = 4294967183
                                                                             WARNING: Infinite value returned in posterior calculation!
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
0 : GPU inversion error. Setting sample posterior probability to zero.
0 : Param values:  [10. 10. 10. 10. 10. 10. 10. 10. 10.]
0 : info = 4294967183
                                                                             WARNING: Infinite value returned in posterior calculation!
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
0 : GPU inversion error. Setting sample posterior probability to zero.
0 : Param values:  [10. 10. 10. 10. 10. 10. 10. 10. 10.]
0 : info = 4294967183
                                                                             WARNING: Infinite value returned in posterior calculation!
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
0 : GPU inversion error. Setting sample posterior probability to zero.
0 : Param values:  [10. 10. 10. 10. 10. 10. 10. 10. 10.]
0 : info = 4294967183
                                                                             WARNING: Infinite value returned in posterior calculation!
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
0 : GPU inversion error. Setting sample posterior probability to zero.
0 : Param values:  [10. 10. 10. 10. 10. 10. 10. 10. 10.]
0 : info = 4294967183
                                                                             WARNING: Infinite value returned in posterior calculation!
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
0 : GPU inversion error. Setting sample posterior probability to zero.
0 : Param values:  [10. 10. 10. 10. 10. 10. 10. 10. 10.]
0 : info = 4294967183
                                                                             WARNING: Infinite value returned in posterior calculation!
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
0 : GPU inversion error. Setting sample posterior probability to zero.
0 : Param values:  [10. 10. 10. 10. 10. 10. 10. 10. 10.]
0 : info = 4294967183
                                                                             WARNING: Infinite value returned in posterior calculation!
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
0 : GPU inversion error. Setting sample posterior probability to zero.
0 : Param values:  [10. 10. 10. 10. 10. 10. 10. 10. 10.]
0 : info = 4294967183
                                                                             WARNING: Infinite value returned in posterior calculation!
Average evaluation time: 1.304091453552246

                                                                                         Running power spectrum analysis...

╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Analysis                                                                                                                                                                                                          │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Log files written successfully to /home/nathan/src/BayesEoR/chains/MN-Test-15-15-38-0-2-6.2E-03-2.63-2.82-lp-dPS-v7
 MultiNest Warning: no resume file found, starting from scratch
 *****************************************************
 MultiNest v3.10
 Copyright Farhan Feroz & Mike Hobson
 Release Jul 2015

 no. of live points =  225
 dimensionality =    9
 *****************************************************
Fatal Python error: gilstate_tss_set: failed to set current tstate (TSS)
Python runtime state: initialized

Thread 0x00007fbfe720e740 (most recent call first):
  <no Python frame>

Thread 0x00007fbfe720e740 (most recent call first):
  File "/home/nathan/anaconda3/envs/bayeseor/lib/python3.12/site-packages/pymultinest/run.py", line 285 in run
  File "/home/nathan/anaconda3/envs/bayeseor/lib/python3.12/site-packages/pymultinest/solve.py", line 71 in solve
  File "/home/nathan/src/BayesEoR/run-analysis.py", line 570 in <module>
Aborted (core dumped)

It looks like something has gone wrong in MAGMA and then likely it has run out of memory.

jburba commented 1 week ago

@musoke I've never seen this error before. A few questions for you:

  1. Are you using the latest version of the main branch?
  2. Have you created a conda environment from the environment.yaml file on the latest main branch?
    • It looks like you've installed magma with conda already, I just want to double check.
  3. What system and GPU(s) are you running this test with?

The function in the posterior calculation which uses magma already contains a magma_init() call, so I think this error suggests a communication error between magma and your GPU(s).

musoke commented 1 week ago
  1. Yes.

    
    (bayeseor) [nathan@larb BayesEoR] $ git show
    commit a80c1f2c368353e8354ddd14a3a43764d5eddfee (HEAD -> main, origin/main, origin/HEAD)
    Merge: 695e361 17d185e
    Author: jburba <jburba@users.noreply.github.com>
    Date:   Tue Jun 11 11:04:33 2024 +0100
    
    Merge pull request #33 from PSims/joss-edits
    
    Joss edits to address #30

2. I did not recreate the conda environment after pulling the updates, but will try that now.
3. Ubuntu 20.04 (yes, I need to update) with a NVIDIA RTX A3000

(bayeseor) [nathan@larb BayesEoR] $ uname -a Linux larb 5.15.0-107-generic #117~20.04.1-Ubuntu SMP Tue Apr 30 10:35:57 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

and 

(bayeseor) [nathan@larb BayesEoR] $ nvidia-smi -L GPU 0: NVIDIA RTX A3000 Laptop GPU (UUID: GPU-e990a63d-4550-5bf1-f637-e85571910ab0)



Yes, the error seems to be connected to how/where I ran it, given that it doesn't happen on other machines.
jburba commented 1 week ago

Hopefully rebuilding your conda environment with the latest environment yaml will help. We have cuda and magma installed by conda now. Depending upon when you built your environment last that may or may not have been the case.

Regardless, please let me know how things go after you rebuild the environment!

musoke commented 1 week ago

It looks like the environment.yaml I used did have those included already:

(bayeseor) [nathan@larb BayesEoR] $ conda list
# packages in environment at /home/nathan/anaconda3/envs/bayeseor:
#
# Name                    Version                   Build  Channel
...
cuda                      12.5.0               ha804496_0    conda-forge
cuda-cccl                 12.5.39              ha770c72_0    conda-forge
cuda-cccl_linux-64        12.5.39              ha770c72_0    conda-forge
cuda-command-line-tools   12.5.0               ha770c72_0    conda-forge
cuda-compiler             12.5.0               hbad6d8a_0    conda-forge
cuda-crt-dev_linux-64     12.5.40              ha770c72_0    conda-forge
cuda-crt-tools            12.5.40              ha770c72_0    conda-forge
cuda-cudart               12.5.39              he02047a_0    conda-forge
cuda-cudart-dev           12.5.39              he02047a_0    conda-forge
cuda-cudart-dev_linux-64  12.5.39              h85509e4_0    conda-forge
cuda-cudart-static        12.5.39              he02047a_0    conda-forge
cuda-cudart-static_linux-64 12.5.39              h85509e4_0    conda-forge
cuda-cudart_linux-64      12.5.39              h85509e4_0    conda-forge
cuda-cuobjdump            12.5.39              he02047a_0    conda-forge
cuda-cupti                12.5.39              he02047a_0    conda-forge
cuda-cupti-dev            12.5.39              he02047a_0    conda-forge
cuda-cuxxfilt             12.5.39              he02047a_0    conda-forge
cuda-driver-dev           12.5.39              he02047a_0    conda-forge
cuda-driver-dev_linux-64  12.5.39              h85509e4_0    conda-forge
cuda-gdb                  12.5.39              hda18ab6_0    conda-forge
cuda-libraries            12.5.0               ha770c72_0    conda-forge
cuda-libraries-dev        12.5.0               ha770c72_0    conda-forge
cuda-nsight               12.5.39              ha770c72_0    conda-forge
cuda-nvcc                 12.5.40              hcdd1206_0    conda-forge
cuda-nvcc-dev_linux-64    12.5.40              ha770c72_0    conda-forge
cuda-nvcc-impl            12.5.40              hd3aeb46_0    conda-forge
cuda-nvcc-tools           12.5.40              hd3aeb46_0    conda-forge
cuda-nvcc_linux-64        12.5.40              h8a487aa_0    conda-forge
cuda-nvdisasm             12.5.39              he02047a_0    conda-forge
cuda-nvml-dev             12.5.39              he02047a_0    conda-forge
cuda-nvprof               12.5.39              he02047a_0    conda-forge
cuda-nvprune              12.5.39              he02047a_0    conda-forge
cuda-nvrtc                12.5.40              he02047a_0    conda-forge
cuda-nvrtc-dev            12.5.40              he02047a_0    conda-forge
cuda-nvtx                 12.5.39              he02047a_0    conda-forge
cuda-nvvm-dev_linux-64    12.5.40              ha770c72_0    conda-forge
cuda-nvvm-impl            12.5.40              h59595ed_0    conda-forge
cuda-nvvm-tools           12.5.40              h59595ed_0    conda-forge
cuda-nvvp                 12.5.39              he02047a_0    conda-forge
cuda-opencl               12.5.39              he02047a_0    conda-forge
cuda-opencl-dev           12.5.39              he02047a_0    conda-forge
cuda-profiler-api         12.5.39              ha770c72_0    conda-forge
cuda-runtime              12.5.0               ha804496_0    conda-forge
cuda-sanitizer-api        12.5.39              he02047a_0    conda-forge
cuda-toolkit              12.5.0               ha804496_0    conda-forge
cuda-tools                12.5.0               ha770c72_0    conda-forge
cuda-version              12.5                 hd4f0392_3    conda-forge
cuda-visual-tools         12.5.0               ha770c72_0    conda-forge
...
magma                     2.7.2                h51420fd_3    conda-forge
...

(Lots of other dependencies snipped for concision, I can post the full output if that's useful.)

I am recreating the environment now - maybe something else went wrong in the package versions?

jburba commented 1 week ago

I'm not sure. I just wanted to make sure both cuda and magma were installed via conda. I'm trying to dig into the magma source now to see if I can find where that error message is triggered.

jburba commented 1 week ago

I've found a github issue for pytorch about this error message but there isn't a clear answer as to what fixed the error on their end. It could possibly be related to a version mismatch between magma and cuda, but this pytorch issue is a few years old so it's hard to say if that's still a possibility.

This google groups post contains the same error message and someone suggests that they saw this error if the version of cuda they've installed is too new for the drivers installed on your system. I don't know what version of cuda you have installed natively on your machine, but maybe cuda 12.5 is not compatible with the nVIDIA drivers on your machine?

musoke commented 1 week ago

That's possible - is cuda 12.5 required for this project? I can try other versions.

On Thu, Jun 27, 2024, 06:36 jburba @.***> wrote:

I've found a github issue for pytorch about this error message https://github.com/pytorch/pytorch/issues/60175 but there isn't a clear answer as to what fixed the error on their end. It could possibly be related to a version mismatch between magma and cuda, but this pytorch issue is a few years old so it's hard to say if that's still a possibility.

This google groups post https://groups.google.com/a/icl.utk.edu/g/magma-user/c/7j-GI3uzNpw contains the same error message and someone suggests that they saw this error if the version of cuda they've installed is too new for the drivers installed on your system. I don't know what version of cuda you have installed natively on your machine, but maybe cuda 12.5 is not compatible with the nVIDIA drivers on your machine?

— Reply to this email directly, view it on GitHub https://github.com/PSims/BayesEoR/issues/31#issuecomment-2194348226, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7ET7H7DBNJDWLKWVWIQPTZJPTKRAVCNFSM6AAAAABIZMMPDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJUGM2DQMRSGY . You are receiving this because you were mentioned.Message ID: @.***>

jburba commented 1 week ago

I don't think so. We haven't run into any cuda version issues in the past, though, so I'm not sure.

musoke commented 1 week ago

Yes. nvidia-smi says I have cuda version 11.4.

python-cuda version 11.7 works with pytorch in my main project that uses pytorch.

jburba commented 1 week ago

I might try installing one of those versions of cuda, either 11.4 or 11.7, in your bayeseor conda environment to see if that solves this issue?

It doesn't look like conda-forge hosts cuda 11.x. You might have to install cuda 11.4 or 11.7 from the nvidia channel by placing

  - cuda::nvidia=11.7

in your environment.yaml file, for example, if you want cuda 11.7. Hopefully, installing cuda from another channel won't cause any dependency/solving issues.

musoke commented 1 week ago

I have tried both that syntax and

channels:
  - conda-forge
  - nvidia

but was unable to install the older version