Open musoke opened 1 month ago
Hi @musoke , thanks for pointing this out. I've added another flag to the example-config.py
file which will run the analysis if you're using a single MPI process in my latest commit 695e361. That toggle is controlled by the --single-node
command line argument or single_node
in the configuration yaml. Please try running
python run-analysis.py --config example-config.yaml --gpu
again after pulling the latest version of main.
Hi @jburba, thanks for adding that flag. I have tried again running with the current main, but now get a different error:
(bayeseor) [nathan@larb BayesEoR] $ python run-analysis.py --config example-config.yaml --gpu
mpi_size: 1
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Parameters │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
{'achromatic_beam': False,
'antenna_diameter': None,
'array_dir_prefix': './array-storage/',
'bandwidth_MHz': 9.029521500604,
'beam_center': None,
'beam_peak_amplitude': 1.0,
'beam_ref_freq': None,
'beam_type': 'gaussian',
'beta': [2.63, 2.82],
'central_jd': 2458098.3065661727,
'channel_width_MHz': 0.237618986858,
'clobber': False,
'config': [Path_fr(example-config.yaml, cwd=/home/nathan/src/BayesEoR)],
'cosfreq': None,
'data_path': Path_fr(./test_data/visibilities.npy, cwd=/home/nathan/src/BayesEoR),
'deta': 1.1074783973138646e-07,
'drift_scan': True,
'du_eor': 4.438755506838748,
'du_fg': 4.438755506838748,
'dv_eor': 4.438755506838748,
'dv_fg': 4.438755506838748,
'eor_random_seed': 892736,
'eor_sim_path': None,
'file_root': None,
'fit_for_monopole': False,
'fit_for_shg_amps': False,
'fit_for_spectral_model_parameters': False,
'fov_dec_eor': 12.9080728652,
'fov_dec_fg': 12.9080728652,
'fov_ra_eor': 12.9080728652,
'fov_ra_fg': 12.9080728652,
'freqs_MHz': array([158.30404874, 158.54166773, 158.77928672, 159.0169057 ,
159.25452469, 159.49214368, 159.72976266, 159.96738165,
160.20500064, 160.44261962, 160.68023861, 160.9178576 ,
161.15547659, 161.39309557, 161.63071456, 161.86833355,
162.10595253, 162.34357152, 162.58119051, 162.81880949,
163.05642848, 163.29404747, 163.53166645, 163.76928544,
164.00690443, 164.24452341, 164.4821424 , 164.71976139,
164.95738038, 165.19499936, 165.43261835, 165.67023734,
165.90785632, 166.14547531, 166.3830943 , 166.62071328,
166.85833227, 167.09595126]),
'fwhm_deg': 9.306821090681533,
'include_instrumental_effects': True,
'inst_model': Path_dr(./test_data/, cwd=/home/nathan/src/BayesEoR),
'integration_time_seconds': 11.0,
'inverse_LW_power': 1e-16,
'log_priors': True,
'neta': 38,
'nf': 38,
'noise_data_path': None,
'noise_seed': 742123,
'npl': 2,
'npl_sh': None,
'nq': 0,
'nq_sh': None,
'nside': 128,
'nt': 34,
'nu': 15,
'nu_fg': 15,
'nu_min_MHz': 158.304048743,
'nu_sh': None,
'nuv': 224,
'nuv_fg': 224,
'nuv_sh': None,
'nv': 15,
'nv_fg': 15,
'nv_sh': None,
'output_dir': './chains/',
'pl_grid_spacing': None,
'pl_max': None,
'pl_min': None,
'priors': [[-2.0, 2.0],
[-1.2, 2.8],
[-0.7, 3.3],
[0.7, 2.7],
[1.1, 3.1],
[1.5, 3.5],
[2.0, 4.0],
[2.4, 4.4],
[2.7, 4.7]],
'ps_box_size_dec_Mpc': 2039.0643396379605,
'ps_box_size_para_Mpc': 148.80444173068133,
'ps_box_size_ra_Mpc': 2039.0643396379605,
'redshift': 7.7302135941678465,
'sigma': 0.00615864342588761,
'simple_za_filter': True,
'single_node': True,
'speed_of_light': 299792458.0,
'taper_func': None,
'telescope_latlonalt': [-30.72152777777791,
21.428305555555557,
1073.0000000093132],
'uprior_bins': '',
'useGPU': True,
'use_LWM_Gaussian_prior': False,
'use_Multinest': True,
'use_intrinsic_noise_fitting': False,
'use_shg': False,
'use_sparse_matrices': True,
'verbose': False}
Output directory: /home/nathan/src/BayesEoR/chains/MN-Test-15-15-38-0-2-6.2E-03-2.63-2.82-lp-dPS-v7
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Matrices │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Array save directory: array-storage/nu-15-nv-15-neta-38-sigma-6.16E-03-nside-128-fov-deg-12.9-za-filter-nq-0/test_data-gaussian-beam-fwhm-9.3068deg-dspb/
Matrix stack complete
---Calculating k-vals---
0 0.04618764702082546
1 0.08652193815249166
2 0.12806815323513662
3 0.16994735208289397
4 0.23300561988455756
5 0.3172473331596665
6 0.44376262151460144
7 0.6125469140767436
8 0.7518323962451139
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Data and Noise │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Using data at ./test_data/visibilities.npy
Generating noise:
Seeding numpy.random with 742123
Hermitian symmetry checks:
signal is Hermitian: True
signal + noise is Hermitian: True
SNR:
Stddev(signal) = 3.0068e-03
Stddev(noise) = 6.1603e-03
SNR = 4.8809e-01
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Posterior │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
priors = [[-2.0, 2.0], [-1.2, 2.8], [-0.7, 3.3], [0.7, 2.7], [1.1, 3.1], [1.5, 3.5], [2.0, 4.0], [2.4, 4.4], [2.7, 4.7]]
Instantiating posterior class:
Using log-priors
Calculating dimensionless_PS
Setting inverse_LW_power to 1e-16
Loading shared library from /home/nathan/anaconda3/envs/bayeseor/lib/libmagma.so
Computing on GPU(s)
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
0 : GPU inversion error. Setting sample posterior probability to zero.
0 : Param values: [10. 10. 10. 10. 10. 10. 10. 10. 10.]
0 : info = 4294967183
WARNING: Infinite value returned in posterior calculation!
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
0 : GPU inversion error. Setting sample posterior probability to zero.
0 : Param values: [10. 10. 10. 10. 10. 10. 10. 10. 10.]
0 : info = 4294967183
WARNING: Infinite value returned in posterior calculation!
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
0 : GPU inversion error. Setting sample posterior probability to zero.
0 : Param values: [10. 10. 10. 10. 10. 10. 10. 10. 10.]
0 : info = 4294967183
WARNING: Infinite value returned in posterior calculation!
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
0 : GPU inversion error. Setting sample posterior probability to zero.
0 : Param values: [10. 10. 10. 10. 10. 10. 10. 10. 10.]
0 : info = 4294967183
WARNING: Infinite value returned in posterior calculation!
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
0 : GPU inversion error. Setting sample posterior probability to zero.
0 : Param values: [10. 10. 10. 10. 10. 10. 10. 10. 10.]
0 : info = 4294967183
WARNING: Infinite value returned in posterior calculation!
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
0 : GPU inversion error. Setting sample posterior probability to zero.
0 : Param values: [10. 10. 10. 10. 10. 10. 10. 10. 10.]
0 : info = 4294967183
WARNING: Infinite value returned in posterior calculation!
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
0 : GPU inversion error. Setting sample posterior probability to zero.
0 : Param values: [10. 10. 10. 10. 10. 10. 10. 10. 10.]
0 : info = 4294967183
WARNING: Infinite value returned in posterior calculation!
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
0 : GPU inversion error. Setting sample posterior probability to zero.
0 : Param values: [10. 10. 10. 10. 10. 10. 10. 10. 10.]
0 : info = 4294967183
WARNING: Infinite value returned in posterior calculation!
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
0 : GPU inversion error. Setting sample posterior probability to zero.
0 : Param values: [10. 10. 10. 10. 10. 10. 10. 10. 10.]
0 : info = 4294967183
WARNING: Infinite value returned in posterior calculation!
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
Error in magma_getdevice_arch: MAGMA not initialized (call magma_init() first) or bad device
0 : GPU inversion error. Setting sample posterior probability to zero.
0 : Param values: [10. 10. 10. 10. 10. 10. 10. 10. 10.]
0 : info = 4294967183
WARNING: Infinite value returned in posterior calculation!
Average evaluation time: 1.304091453552246
Running power spectrum analysis...
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Analysis │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Log files written successfully to /home/nathan/src/BayesEoR/chains/MN-Test-15-15-38-0-2-6.2E-03-2.63-2.82-lp-dPS-v7
MultiNest Warning: no resume file found, starting from scratch
*****************************************************
MultiNest v3.10
Copyright Farhan Feroz & Mike Hobson
Release Jul 2015
no. of live points = 225
dimensionality = 9
*****************************************************
Fatal Python error: gilstate_tss_set: failed to set current tstate (TSS)
Python runtime state: initialized
Thread 0x00007fbfe720e740 (most recent call first):
<no Python frame>
Thread 0x00007fbfe720e740 (most recent call first):
File "/home/nathan/anaconda3/envs/bayeseor/lib/python3.12/site-packages/pymultinest/run.py", line 285 in run
File "/home/nathan/anaconda3/envs/bayeseor/lib/python3.12/site-packages/pymultinest/solve.py", line 71 in solve
File "/home/nathan/src/BayesEoR/run-analysis.py", line 570 in <module>
Aborted (core dumped)
It looks like something has gone wrong in MAGMA and then likely it has run out of memory.
@musoke I've never seen this error before. A few questions for you:
environment.yaml
file on the latest main branch?
magma
with conda already, I just want to double check. The function in the posterior calculation which uses magma
already contains a magma_init()
call, so I think this error suggests a communication error between magma
and your GPU(s).
Yes.
(bayeseor) [nathan@larb BayesEoR] $ git show
commit a80c1f2c368353e8354ddd14a3a43764d5eddfee (HEAD -> main, origin/main, origin/HEAD)
Merge: 695e361 17d185e
Author: jburba <jburba@users.noreply.github.com>
Date: Tue Jun 11 11:04:33 2024 +0100
Merge pull request #33 from PSims/joss-edits
Joss edits to address #30
2. I did not recreate the conda environment after pulling the updates, but will try that now.
3. Ubuntu 20.04 (yes, I need to update) with a NVIDIA RTX A3000
(bayeseor) [nathan@larb BayesEoR] $ uname -a Linux larb 5.15.0-107-generic #117~20.04.1-Ubuntu SMP Tue Apr 30 10:35:57 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
and
(bayeseor) [nathan@larb BayesEoR] $ nvidia-smi -L GPU 0: NVIDIA RTX A3000 Laptop GPU (UUID: GPU-e990a63d-4550-5bf1-f637-e85571910ab0)
Yes, the error seems to be connected to how/where I ran it, given that it doesn't happen on other machines.
Hopefully rebuilding your conda environment with the latest environment yaml will help. We have cuda and magma installed by conda now. Depending upon when you built your environment last that may or may not have been the case.
Regardless, please let me know how things go after you rebuild the environment!
It looks like the environment.yaml I used did have those included already:
(bayeseor) [nathan@larb BayesEoR] $ conda list
# packages in environment at /home/nathan/anaconda3/envs/bayeseor:
#
# Name Version Build Channel
...
cuda 12.5.0 ha804496_0 conda-forge
cuda-cccl 12.5.39 ha770c72_0 conda-forge
cuda-cccl_linux-64 12.5.39 ha770c72_0 conda-forge
cuda-command-line-tools 12.5.0 ha770c72_0 conda-forge
cuda-compiler 12.5.0 hbad6d8a_0 conda-forge
cuda-crt-dev_linux-64 12.5.40 ha770c72_0 conda-forge
cuda-crt-tools 12.5.40 ha770c72_0 conda-forge
cuda-cudart 12.5.39 he02047a_0 conda-forge
cuda-cudart-dev 12.5.39 he02047a_0 conda-forge
cuda-cudart-dev_linux-64 12.5.39 h85509e4_0 conda-forge
cuda-cudart-static 12.5.39 he02047a_0 conda-forge
cuda-cudart-static_linux-64 12.5.39 h85509e4_0 conda-forge
cuda-cudart_linux-64 12.5.39 h85509e4_0 conda-forge
cuda-cuobjdump 12.5.39 he02047a_0 conda-forge
cuda-cupti 12.5.39 he02047a_0 conda-forge
cuda-cupti-dev 12.5.39 he02047a_0 conda-forge
cuda-cuxxfilt 12.5.39 he02047a_0 conda-forge
cuda-driver-dev 12.5.39 he02047a_0 conda-forge
cuda-driver-dev_linux-64 12.5.39 h85509e4_0 conda-forge
cuda-gdb 12.5.39 hda18ab6_0 conda-forge
cuda-libraries 12.5.0 ha770c72_0 conda-forge
cuda-libraries-dev 12.5.0 ha770c72_0 conda-forge
cuda-nsight 12.5.39 ha770c72_0 conda-forge
cuda-nvcc 12.5.40 hcdd1206_0 conda-forge
cuda-nvcc-dev_linux-64 12.5.40 ha770c72_0 conda-forge
cuda-nvcc-impl 12.5.40 hd3aeb46_0 conda-forge
cuda-nvcc-tools 12.5.40 hd3aeb46_0 conda-forge
cuda-nvcc_linux-64 12.5.40 h8a487aa_0 conda-forge
cuda-nvdisasm 12.5.39 he02047a_0 conda-forge
cuda-nvml-dev 12.5.39 he02047a_0 conda-forge
cuda-nvprof 12.5.39 he02047a_0 conda-forge
cuda-nvprune 12.5.39 he02047a_0 conda-forge
cuda-nvrtc 12.5.40 he02047a_0 conda-forge
cuda-nvrtc-dev 12.5.40 he02047a_0 conda-forge
cuda-nvtx 12.5.39 he02047a_0 conda-forge
cuda-nvvm-dev_linux-64 12.5.40 ha770c72_0 conda-forge
cuda-nvvm-impl 12.5.40 h59595ed_0 conda-forge
cuda-nvvm-tools 12.5.40 h59595ed_0 conda-forge
cuda-nvvp 12.5.39 he02047a_0 conda-forge
cuda-opencl 12.5.39 he02047a_0 conda-forge
cuda-opencl-dev 12.5.39 he02047a_0 conda-forge
cuda-profiler-api 12.5.39 ha770c72_0 conda-forge
cuda-runtime 12.5.0 ha804496_0 conda-forge
cuda-sanitizer-api 12.5.39 he02047a_0 conda-forge
cuda-toolkit 12.5.0 ha804496_0 conda-forge
cuda-tools 12.5.0 ha770c72_0 conda-forge
cuda-version 12.5 hd4f0392_3 conda-forge
cuda-visual-tools 12.5.0 ha770c72_0 conda-forge
...
magma 2.7.2 h51420fd_3 conda-forge
...
(Lots of other dependencies snipped for concision, I can post the full output if that's useful.)
I am recreating the environment now - maybe something else went wrong in the package versions?
I'm not sure. I just wanted to make sure both cuda and magma were installed via conda. I'm trying to dig into the magma
source now to see if I can find where that error message is triggered.
I've found a github issue for pytorch about this error message but there isn't a clear answer as to what fixed the error on their end. It could possibly be related to a version mismatch between magma
and cuda
, but this pytorch issue is a few years old so it's hard to say if that's still a possibility.
This google groups post contains the same error message and someone suggests that they saw this error if the version of cuda
they've installed is too new for the drivers installed on your system. I don't know what version of cuda
you have installed natively on your machine, but maybe cuda
12.5 is not compatible with the nVIDIA drivers on your machine?
That's possible - is cuda 12.5 required for this project? I can try other versions.
On Thu, Jun 27, 2024, 06:36 jburba @.***> wrote:
I've found a github issue for pytorch about this error message https://github.com/pytorch/pytorch/issues/60175 but there isn't a clear answer as to what fixed the error on their end. It could possibly be related to a version mismatch between magma and cuda, but this pytorch issue is a few years old so it's hard to say if that's still a possibility.
This google groups post https://groups.google.com/a/icl.utk.edu/g/magma-user/c/7j-GI3uzNpw contains the same error message and someone suggests that they saw this error if the version of cuda they've installed is too new for the drivers installed on your system. I don't know what version of cuda you have installed natively on your machine, but maybe cuda 12.5 is not compatible with the nVIDIA drivers on your machine?
— Reply to this email directly, view it on GitHub https://github.com/PSims/BayesEoR/issues/31#issuecomment-2194348226, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7ET7H7DBNJDWLKWVWIQPTZJPTKRAVCNFSM6AAAAABIZMMPDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJUGM2DQMRSGY . You are receiving this because you were mentioned.Message ID: @.***>
I don't think so. We haven't run into any cuda
version issues in the past, though, so I'm not sure.
Yes. nvidia-smi
says I have cuda version 11.4.
python-cuda version 11.7 works with pytorch in my main project that uses pytorch.
I might try installing one of those versions of cuda, either 11.4 or 11.7, in your bayeseor conda environment to see if that solves this issue?
It doesn't look like conda-forge hosts cuda 11.x. You might have to install cuda 11.4 or 11.7 from the nvidia channel by placing
- cuda::nvidia=11.7
in your environment.yaml file, for example, if you want cuda 11.7. Hopefully, installing cuda from another channel won't cause any dependency/solving issues.
I have tried both that syntax and
channels:
- conda-forge
- nvidia
but was unable to install the older version
https://github.com/openjournals/joss-reviews/issues/6667
I have installed the package and attempted to run the tests described in the docs (https://bayeseor.readthedocs.io/en/latest/usage.html#test-dataset)
Running
with the default
example-config.yaml
results in an error:The contents of `./chains/ is