Vfold-RNA / MgNet

0 stars 0 forks source link

ERROR: Failed to load OptiX shared library #1

Closed mafarsani closed 9 months ago

mafarsani commented 9 months ago

I wanted to run this package for the example provided in the tutorial, but I am getting below Error ERROR: Failed to load OptiX shared library. Could you please help me to address the issue? I am printing the system configuration I am using

lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 6 CPU max MHz: 3500.0000 CPU min MHz: 800.0000 BogoMIPS: 5800.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe p opcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 e rms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx5 12vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcn tdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities Virtualization features: Virtualization: VT-x Caches (sum of all):
L1d: 768 KiB (16 instances) L1i: 512 KiB (16 instances) L2: 20 MiB (16 instances) L3: 24 MiB (1 instance) NUMA:
NUMA node(s): 1 NUMA node0 CPU(s): 0-31 Vulnerabilities:
Gather data sampling: Mitigation; Microcode Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Srbds: Not affected Tsx async abort: Not affected

docker -v Docker version 24.0.7, build afdd53b

====================================

nvidia-smi

Fri Dec 15 14:38:59 2023
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3090 Off | 00000000:18:00.0 Off | N/A | | 50% 33C P2 111W / 350W | 542MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 3090 Off | 00000000:51:00.0 Off | N/A | | 51% 22C P8 25W / 350W | 5MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA GeForce RTX 3090 Off | 00000000:8A:00.0 Off | N/A | | 48% 21C P8 18W / 350W | 5MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA GeForce RTX 3090 Off | 00000000:C3:00.0 Off | N/A | | 53% 26C P8 19W / 350W | 5MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 6450 C python 534MiB | +---------------------------------------------------------------------------------------+



it is the whole output printed after running the package. in_rna: /home/mfarsani/MgNet/example/example.pdb out_dir: /home/mfarsani/MgNet/example/ preparing /tmp/example.pdb ... ######## get RNA part ######## Info) VMD for LINUXAMD64, version 1.9.4a57 (April 27, 2022) Info) http://www.ks.uiuc.edu/Research/vmd/
Info) Email questions and bug reports to vmd@ks.uiuc.edu
Info) Please include this reference in published work using VMD:
Info) Humphrey, W., Dalke, A. and Schulten, K., `VMD - Visual
Info) Molecular Dynamics', J. Molec. Graphics 1996, 14.1, 33-38. Info) ------------------------------------------------------------- Info) Multithreading available, 32 CPUs. Info) CPU features: SSE2 SSE4.1 AVX AVX2 FMA F16 AVX512F AVX512CD HT Info) Free system memory: 752GB (99%) Info) Creating CUDA device pool and initializing hardware... Info) Detected 4 available CUDA accelerators: Info) [0-3] NVIDIA GeForce RTX 3090 82 SM_8.6 1.7 GHz, 24GB RAM SP32 AE2 ZC OptiXRenderer) ERROR: Failed to load OptiX shared library. OptiXRenderer) NVIDIA driver may be too old. OptiXRenderer) Check/update NVIDIA driver Info) Dynamically loaded 3 plugins in directory: Info) /usr/local/lib/vmd/plugins/LINUXAMD64/molfile /tmp/mgnet/example//example.pdb /tmp/mgnet/example//example_rna.pdb Info) Using plugin pdb for structure file /tmp/mgnet/example//example.pdb Info) Using plugin pdb for coordinates from file /tmp/mgnet/example//example.pdb Info) Determining bond structure from distance search ... Info) Finished with coordinate file /tmp/mgnet/example//example.pdb. Info) Analyzing structure ... Info) Atoms: 810 Info) Bonds: 907 Info) Angles: 0 Dihedrals: 0 Impropers: 0 Cross-terms: 0 Info) Bondtypes: 0 Angletypes: 0 Dihedraltypes: 0 Impropertypes: 0 Info) Residues: 38 Info) Waters: 0 Info) Segments: 1 Info) Fragments: 1 Protein: 0 Nucleic: 1 0 atomselect0 Info) Opened coordinate file /tmp/mgnet/example//example_rna.pdb for writing. Info) Finished with coordinate file /tmp/mgnet/example//example_rna.pdb. Info) VMD for LINUXAMD64, version 1.9.4a57 (April 27, 2022) Info) Exiting normally. vmd > ######## remove altloc ######## ######## generate pdbqt ######## setting PYTHONHOME environment ######## voxelization ######## /opt/conda/lib/python3.6/site-packages/htmd/molecule/util.py:666: NumbaPerformanceWarning: np.dot() is faster on contiguous arrays, called on (array(float32, 2d, A), array(float32, 2d, A)) covariance = np.dot(P.T, Q) /opt/conda/lib/python3.6/site-packages/htmd/molecule/util.py:704: NumbaPerformanceWarning: np.dot() is faster on contiguous arrays, called on (array(float32, 2d, C), array(float32, 2d, A)) all1 = np.dot(all1, rot.T) ffevaluate module is in beta version 2023-12-13 00:00:39,492 - binstar - INFO - Using Anaconda API: https://api.anaconda.org There is something wrong with your /root/.htmd/.latestversion file. Will not check for new HTMD versions. usage: python 3-voxelization.py inrnapdb inpdbqt save_folder example_rna 1 1 G image non zero 13993 Mg non zero 0 occupancies 12300 partial_charges 1693 example_rna 1 2 G image non zero 17894 Mg non zero 0 occupancies 15704 partial_charges 2190 example_rna 1 3 A image non zero 21831 Mg non zero 0 occupancies 19182 partial_charges 2649 example_rna 1 4 U image non zero 25522 Mg non zero 0 occupancies 22414 partial_charges 3108 example_rna 1 5 A image non zero 25482 Mg non zero 0 occupancies 22406 partial_charges 3076 example_rna 1 6 C image non zero 24653 Mg non zero 0 occupancies 21704 partial_charges 2949 example_rna 1 7 A image non zero 24794 Mg non zero 0 occupancies 21819 partial_charges 2975 example_rna 1 8 C image non zero 28194 Mg non zero 0 occupancies 24745 partial_charges 3449 example_rna 1 9 A image non zero 31246 Mg non zero 0 occupancies 27466 partial_charges 3780 example_rna 1 10 A image non zero 27995 Mg non zero 0 occupancies 24567 partial_charges 3428 example_rna 1 11 G image non zero 28724 Mg non zero 0 occupancies 25232 partial_charges 3492 example_rna 1 12 A image non zero 26966 Mg non zero 0 occupancies 23761 partial_charges 3205 example_rna 1 13 G image non zero 26677 Mg non zero 0 occupancies 23400 partial_charges 3277 example_rna 1 14 U image non zero 17950 Mg non zero 0 occupancies 15743 partial_charges 2207 example_rna 1 15 G image non zero 18695 Mg non zero 0 occupancies 16455 partial_charges 2240 example_rna 1 16 A image non zero 16862 Mg non zero 0 occupancies 14994 partial_charges 1868 example_rna 1 17 U image non zero 23760 Mg non zero 0 occupancies 20956 partial_charges 2804 example_rna 1 18 U image non zero 29173 Mg non zero 0 occupancies 25711 partial_charges 3462 example_rna 1 19 G image non zero 30589 Mg non zero 0 occupancies 26948 partial_charges 3641 example_rna 1 20 A image non zero 21475 Mg non zero 0 occupancies 18898 partial_charges 2577 example_rna 1 21 A image non zero 20073 Mg non zero 0 occupancies 17741 partial_charges 2332 example_rna 1 22 A image non zero 38715 Mg non zero 0 occupancies 34095 partial_charges 4620 example_rna 1 23 C image non zero 35548 Mg non zero 0 occupancies 31239 partial_charges 4309 example_rna 1 24 U image non zero 20053 Mg non zero 0 occupancies 17686 partial_charges 2367 example_rna 1 25 A image non zero 20261 Mg non zero 0 occupancies 17797 partial_charges 2464 example_rna 1 26 A image non zero 31248 Mg non zero 0 occupancies 27432 partial_charges 3816 example_rna 1 27 G image non zero 28195 Mg non zero 0 occupancies 24737 partial_charges 3458 example_rna 1 28 U image non zero 29170 Mg non zero 0 occupancies 25629 partial_charges 3541 example_rna 1 29 C image non zero 26772 Mg non zero 0 occupancies 23460 partial_charges 3312 example_rna 1 30 U image non zero 24851 Mg non zero 0 occupancies 21821 partial_charges 3030 example_rna 1 31 G image non zero 30319 Mg non zero 0 occupancies 26612 partial_charges 3707 example_rna 1 32 U image non zero 26658 Mg non zero 0 occupancies 23343 partial_charges 3315 example_rna 1 33 G image non zero 24129 Mg non zero 0 occupancies 21139 partial_charges 2990 example_rna 1 34 U image non zero 23939 Mg non zero 0 occupancies 21013 partial_charges 2926 example_rna 1 35 A image non zero 23318 Mg non zero 0 occupancies 20512 partial_charges 2806 example_rna 1 36 U image non zero 21863 Mg non zero 0 occupancies 19211 partial_charges 2652 example_rna 1 37 C image non zero 18623 Mg non zero 0 occupancies 16322 partial_charges 2301 example_rna 1 38 C image non zero 14214 Mg non zero 0 occupancies 12476 partial_charges 1738

######## predict, density, cluster ######## ==> Resuming from checkpoint -> /src/MgNet/script/model/checkpoint/cv1/ckpt.e40 /opt/conda/lib/python3.6/site-packages/torch/serialization.py:453: SourceChangeWarning: source code of class 'dncon2.Net' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes. warnings.warn(msg, SourceChangeWarning) len(testset) ---> 38 len(testloader) ---> 38 GPU ---> range(0, 4) Use cudnn ---> 7602 cv_index -> 1 image_dir -> /tmp/mgnet/example//image/ result_dir -> /tmp/mgnet/example//result/cv1//raw/ num_worker -> 30 Traceback (most recent call last): File "/src/MgNet/script//4-predict.py", line 167, in test(start_epoch) File "/src/MgNet/script//4-predict.py", line 117, in test outputs = net(inputs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 369, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, kwargs) File "/src/MgNet/script/dncon2.py", line 93, in forward out = self.conv_first(x) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 478, in forward self.padding, self.dilation, self.groups) RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

Have 3 arguments: /src/MgNet/script//density/density /tmp/mgnet/example//result/cv1//raw/ 0.5 Traceback (most recent call last): File "/src/MgNet/script//5-cluster.py", line 34, in assert os.path.exists(density_folder), f'Error: density_folder does not exist -> {density_folder}' AssertionError: Error: density_folder does not exist -> /tmp/mgnet/example//result/cv1//density/ ==> Resuming from checkpoint -> /src/MgNet/script/model/checkpoint/cv2/ckpt.e40 /opt/conda/lib/python3.6/site-packages/torch/serialization.py:453: SourceChangeWarning: source code of class 'dncon2.Net' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes. warnings.warn(msg, SourceChangeWarning) len(testset) ---> 38 len(testloader) ---> 38 GPU ---> range(0, 4) Use cudnn ---> 7602 cv_index -> 2 image_dir -> /tmp/mgnet/example//image/ result_dir -> /tmp/mgnet/example//result/cv2//raw/ num_worker -> 30 Traceback (most recent call last): File "/src/MgNet/script//4-predict.py", line 167, in test(start_epoch) File "/src/MgNet/script//4-predict.py", line 117, in test outputs = net(inputs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 369, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, kwargs) File "/src/MgNet/script/dncon2.py", line 93, in forward out = self.conv_first(x) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 478, in forward self.padding, self.dilation, self.groups) RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

Have 3 arguments: /src/MgNet/script//density/density /tmp/mgnet/example//result/cv2//raw/ 0.5 Traceback (most recent call last): File "/src/MgNet/script//5-cluster.py", line 34, in assert os.path.exists(density_folder), f'Error: density_folder does not exist -> {density_folder}' AssertionError: Error: density_folder does not exist -> /tmp/mgnet/example//result/cv2//density/ ==> Resuming from checkpoint -> /src/MgNet/script/model/checkpoint/cv3/ckpt.e40 /opt/conda/lib/python3.6/site-packages/torch/serialization.py:453: SourceChangeWarning: source code of class 'dncon2.Net' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes. warnings.warn(msg, SourceChangeWarning) len(testset) ---> 38 len(testloader) ---> 38 GPU ---> range(0, 4) Use cudnn ---> 7602 cv_index -> 3 image_dir -> /tmp/mgnet/example//image/ result_dir -> /tmp/mgnet/example//result/cv3//raw/ num_worker -> 30 Traceback (most recent call last): File "/src/MgNet/script//4-predict.py", line 167, in test(start_epoch) File "/src/MgNet/script//4-predict.py", line 117, in test outputs = net(inputs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 369, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, kwargs) File "/src/MgNet/script/dncon2.py", line 93, in forward out = self.conv_first(x) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 478, in forward self.padding, self.dilation, self.groups) RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

Have 3 arguments: /src/MgNet/script//density/density /tmp/mgnet/example//result/cv3//raw/ 0.5 Traceback (most recent call last): File "/src/MgNet/script//5-cluster.py", line 34, in assert os.path.exists(density_folder), f'Error: density_folder does not exist -> {density_folder}' AssertionError: Error: density_folder does not exist -> /tmp/mgnet/example//result/cv3//density/ ==> Resuming from checkpoint -> /src/MgNet/script/model/checkpoint/cv4/ckpt.e40 /opt/conda/lib/python3.6/site-packages/torch/serialization.py:453: SourceChangeWarning: source code of class 'dncon2.Net' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes. warnings.warn(msg, SourceChangeWarning) len(testset) ---> 38 len(testloader) ---> 38 GPU ---> range(0, 4) Use cudnn ---> 7602 cv_index -> 4 image_dir -> /tmp/mgnet/example//image/ result_dir -> /tmp/mgnet/example//result/cv4//raw/ num_worker -> 30 Traceback (most recent call last): File "/src/MgNet/script//4-predict.py", line 167, in test(start_epoch) File "/src/MgNet/script//4-predict.py", line 117, in test outputs = net(inputs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 369, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, kwargs) File "/src/MgNet/script/dncon2.py", line 93, in forward out = self.conv_first(x) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 478, in forward self.padding, self.dilation, self.groups) RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

Have 3 arguments: /src/MgNet/script//density/density /tmp/mgnet/example//result/cv4//raw/ 0.5 Traceback (most recent call last): File "/src/MgNet/script//5-cluster.py", line 34, in assert os.path.exists(density_folder), f'Error: density_folder does not exist -> {density_folder}' AssertionError: Error: density_folder does not exist -> /tmp/mgnet/example//result/cv4//density/ ==> Resuming from checkpoint -> /src/MgNet/script/model/checkpoint/cv5/ckpt.e40 /opt/conda/lib/python3.6/site-packages/torch/serialization.py:453: SourceChangeWarning: source code of class 'dncon2.Net' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes. warnings.warn(msg, SourceChangeWarning) len(testset) ---> 38 len(testloader) ---> 38 GPU ---> range(0, 4) Use cudnn ---> 7602 cv_index -> 5 image_dir -> /tmp/mgnet/example//image/ result_dir -> /tmp/mgnet/example//result/cv5//raw/ num_worker -> 30 Traceback (most recent call last): File "/src/MgNet/script//4-predict.py", line 167, in test(start_epoch) File "/src/MgNet/script//4-predict.py", line 117, in test outputs = net(inputs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 369, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, kwargs) File "/src/MgNet/script/dncon2.py", line 93, in forward out = self.conv_first(x) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 478, in forward self.padding, self.dilation, self.groups) RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

Have 3 arguments: /src/MgNet/script//density/density /tmp/mgnet/example//result/cv5//raw/ 0.5 Traceback (most recent call last): File "/src/MgNet/script//5-cluster.py", line 34, in assert os.path.exists(density_folder), f'Error: density_folder does not exist -> {density_folder}' AssertionError: Error: density_folder does not exist -> /tmp/mgnet/example//result/cv5//density/ ######## MgNet completed ########

zhou0312 commented 9 months ago

Hi,

I have updated the docker image, the newest tag is 1.0.1. What I did are bascially the following three things:

  1. set torch.backends.cudnn.enabled = False
  2. set torch.backends.cudnn.benchmark = False
  3. restricting the model to run only on a single GPU device

To run the new model, simply go to root of your cloned MgNet folder and run the following commands: git pull ./setup And then proceeds to run the example case.

Please let know if this fixes the error.

mafarsani commented 9 months ago

Hello Thank you for the hints and update. I could successfully run it and got the results. Best regards.