Closed balston closed 2 years ago
I'm contacting the license holders in Chem Eng asking for confirmation that this is OK and the arrangements for access to the reserved application group.
We have the go ahead from the license holder in Chemical Eng to proceed with the installation on Grace.
I now have the Comsol installer in:
/shared/ucl/apps/Comsol/Installers/COMSOL53a_lnx.tar.gz
It looks like Comsol needs Matlab R2018a installed. We haven't installed R2018a on Grace yet so doing this first.
Matlab R2018a installed on Grace.
I've now done a test install of Comsol on Grace. Will now try and run one of the multicore tutorial examples and sort out the module file. Use:
COMSOL Multiphysics>Tutorial models>micromixer_cluster
Also need to create a reserved application group.
I'm creating the group lgcomsol for the Chemical Eng installation of COMSOL. I'm adding my ucaabaa test account.
Group set up. Changed ownership of files and permissions:
cd /shared/ucl/apps/Comsol/
chgrp -R lgcomsol comsol53a
chmod -R o-rwx comsol53a
Added user requesting Comsol to the reserved application group.
Module file pushed and I've got a test job running on Grace 😸
The test job finished. However I didn't have time to check that it worked before going on leave.
I've also emailed the user.
User has now requested installation on Myriad as their simulation is running out of memory on Grace nodes!
Comsol installer copied to Myriad in:
/shared/ucl/apps/Comsol/Installers/COMSOL53a_lnx.tar.gz
Installation finished for Comsol 53a plus update 4 on Myriad.
Update will need to be applied to Grace installation.
Changed ownership of files and permissions.
Now to try running a test job ...
Test job submitted a queuing:
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
62329 0.00000 Comsol_par ccaabaa qw 08/31/2018 15:17:13 36
OK job starts, finishes and produces no output! And no error messages either - need to investigate ...
At least my next attempt has produced some errors:
Model to run is micromixer_batch.mph
(node-j00a-001:0)
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00002ad0f49fcc30, pid=30802, tid=0x00002ad0f35de100
#
# JRE version: Java(TM) SE Runtime Environment (8.0_112-b15) (build 1.8.0_112-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.112-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C [libpthread.so.0+0x9c30] pthread_mutex_lock+0x0
#
# Core dump written. Default location: /lustre/scratch/scratch/ccaabaa/COMSOL/Comsol_parallel_1_62360/core or core.30802
#
# An error report file with more information is saved as:
# /lustre/scratch/scratch/ccaabaa/COMSOL/Comsol_parallel_1_62360/hs_err_pid30802.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 30802 RUNNING AT node-j00a-001
= EXIT CODE: 134
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 30802 RUNNING AT node-j00a-001
= EXIT CODE: 6
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
Intel(R) MPI Library troubleshooting guide:
https://software.intel.com/node/561764
===================================================================================
Finished.
and:
libi40iw-i40iw_ucreate_qp: failed to create QP, unsupported QP type: 0x4
I think the errors have been caused by wrong Fabric settings. I've changed them but now all Comsol licenses are in use so I can't do any further testing at the moment:
Model to run is micromixer_batch.mph
(node-i00a-002:0)
Node 0 is running on host: node-i00a-002.myriad.ucl.ac.uk
Node 0 has address: node-i00a-002
*******************************************
***COMSOL 5.3.1.348 progress output file***
*******************************************
Fri Aug 31 16:06:06 BST 2018
COMSOL Multiphysics 5.3a (Build: 348) starting in batch mode
Opening file: /lustre/scratch/scratch/ccaabaa/COMSOL/Comsol_parallel_1_62363/micromixer_batch.mph
/******************/
/*****Error********/
/******************/
Could not obtain license for COMSOL Multiphysics.
License error: -4.
Licensed number of users already reached.
Feature: COMSOL
License path: /lustre/shared/ucl/apps/Comsol/comsol53a/multiphysics/license/license.dat:
FlexNet Licensing error:-4,132
For further information, refer to the FlexNet Licensing documentation,
available at "www.flexerasoftware.com".
Total time: 7 s.
Finished.
I've now been able to run my test job successfully:
Model to run is micromixer_batch.mph
(node-j00a-001:0)
Node 0 is running on host: node-j00a-001.myriad.ucl.ac.uk
Node 0 has address: node-j00a-001.myriad.ucl.ac.uk
*******************************************
***COMSOL 5.3.1.348 progress output file***
*******************************************
Mon Sep 03 12:55:26 BST 2018
COMSOL Multiphysics 5.3a (Build: 348) starting in batch mode
Opening file: /lustre/scratch/scratch/ccaabaa/COMSOL/Comsol_parallel_1_66241/micromixer_batch.mph
Open time: 8 s.
Running: Study 1
Running in distributed mode using 1 nodes.
Running on 2 x Intel(R) Xeon(R) Gold 6140 CPU at 2.30 GHz.
Using 2 sockets with 36 cores in total on node-j00a-001.myriad.ucl.ac.uk.
Available memory: 192.90 GB.
Current Progress: 0 % - Free Tetrahedral 1
Memory: 874/874 12480/12480
Number of vertex elements: 188
Current Progress: 0 % - Analyzing domains
Memory: 913/913 12491/12491
Current Progress: 0 % - Adjusting boundary mesh
Memory: 920/920 12496/12496
Number of edge elements: 1974
Number of boundary elements: 13134
Current Progress: 1 % - Creating initial tetrahedra
Memory: 922/922 12497/12497
Current Progress: 1 % - Respecting boundaries
Current Progress: 1 % - Inserting interior points
Memory: 928/928 12507/12507
Current Progress: 1 % - Improving element quality
Memory: 937/937 12578/12578
Number of elements: 94439
Free meshing time: 1.84s
Minimum element quality: 0.1826
Current Progress: 2 % - Finalizing mesh
Memory: 942/942 12583/12583
Current Progress: 2 % - Boundary Layers 1
Current Progress: 2 % - Inserting boundary layer elements
Memory: 978/978 12616/12616
Current Progress: 2 % - Smoothing transition to interior mesh
Memory: 1020/1020 12657/12657
Current Progress: 3 % - Smoothing transition to interior mesh
Memory: 1021/1021 12658/12658
<---- Compile Equations: Stationary in Study 1/Solution 1 (sol1) ---------------
Started at 3-Sep-2018 12:55:48.
---------- Current Progress: 100 % -
Solution time: 97 s. (1 minute, 37 seconds)
Physical memory: 7.05 GB
Virtual memory: 16.83 GB
Ended at 3-Sep-2018 13:01:09.
----- Stationary Solver 2 in Study 1/Solution 1 (sol1) ------------------------>
Run time: 336 s.
Saving model: /lustre/scratch/scratch/ccaabaa/COMSOL/Comsol_parallel_1_66241/mymodelresult.mph
Save time: 1 s.
Total time: 344 s.
---------- Current Progress: 100 % - Done
Memory: 6679/12756 16157/23130
Finished.
so I'm going to add the correct fabric setting for Myriad to the module file.
Module file updated. I'm now running my test job again on:
Informed user.
Now requested to be installed on Legion.
Installer copied from Myriad to Legion in:
/shared/ucl/apps/Comsol/Installers/COMSOL53a_lnx.tar.gz
readable only by ccsp group.
First attempt to install using ccspap2 fails:
Installing to: /shared/ucl/apps/Comsol/comsol53a/multiphysics:
Downloading Acoustics Module...
Downloading Acoustics Module Applications...
Downloading Acoustics Module Documentation...
Downloading Batteries & Fuel Cells Module...
Downloading Batteries & Fuel Cells Module Applications...
Downloading Batteries & Fuel Cells Module Documentation...
Downloading CAD Import Module...
Downloading CAD Import Module Applications...
Downloading CAD Import Module Documentation...
Downloading CFD Module...
Downloading CFD Module Applications...
Downloading CFD Module Documentation...
Downloading COMSOL Cluster Components...
Downloading COMSOL Multiphysics...
Downloading COMSOL Multiphysics Applications...
Downloading COMSOL Multiphysics Documentation...
Downloading Chemical Reaction Engineering Module...
com.comsol.install.FlInstException: com.comsol.install.FlAbortException: Service Unavailable
Removing temporary COMSOL installer components...
Not sure what caused this. I've re-run the build script again and it has completed without error.
Will now run some tests.
Test job used on Myriad submitted requesting 32 cores.
Test job on Legion has not worked. Will need to check on Monday.
My first test job failed at the end:
/******************/
/*****Error********/
/******************/
The following feature has encountered a problem:
- Feature: Stationary Solver 2 (sol1/s2)
Undefined value found.
- Detail: NaN or Inf found when solving linear system using GMRES.
- Error on node 1: Undefined value found.
Saving model: /lustre/scratch/scratch/ccaabaa/COMSOL/Comsol_parallel_1_460545/mymodelresult.mph
Save time: 1 s.
Total time: 1505 s.
Finished.
This morning I've run the test again using 16 cores in a single node at it has completed successfully:
---------- Current Progress: 100 % -
Solution time: 205 s. (3 minutes, 25 seconds)
Physical memory: 3.98 GB
Virtual memory: 11.1 GB
Ended at 1-Jul-2019 11:44:48.
----- Stationary Solver 2 in Study 1/Solution 1 (sol1) ------------------------>
Run time: 575 s.
Saving model: /lustre/scratch/scratch/ccaabaa/COMSOL/Comsol_parallel_1_465676/mymodelresult.mph
Save time: 1 s.
Total time: 591 s.
---------- Current Progress: 100 % - Done
Memory: 3692/8785 10643/16454
Finished.
emailed user about Legion install available for testing.
See if it is possible to add a pre-job start license availability check. May be difficult beacause of the number of different components licensed as part of COMSOL.
Been poking through all the tickets here, seems like the issues with this one are no longer valid.
A request to install the commercial package Comsol Multiphysics https://uk.comsol.com/comsol-multiphysics using the Chemical Engineering Department's license server.
We have details of the license holder in the department and license server details. Installer has been provided.