UCL-RITS / rcps-buildscripts

Scripts to automate package builds on RC Platforms
MIT License
39 stars 26 forks source link

Install request: Comsol Multiphysics for Deartment of Chemical Engineering [IN03062798] #193

Closed balston closed 2 years ago

balston commented 6 years ago

A request to install the commercial package Comsol Multiphysics https://uk.comsol.com/comsol-multiphysics using the Chemical Engineering Department's license server.

We have details of the license holder in the department and license server details. Installer has been provided.

balston commented 6 years ago

I'm contacting the license holders in Chem Eng asking for confirmation that this is OK and the arrangements for access to the reserved application group.

balston commented 6 years ago

We have the go ahead from the license holder in Chemical Eng to proceed with the installation on Grace.

balston commented 6 years ago

I now have the Comsol installer in:

/shared/ucl/apps/Comsol/Installers/COMSOL53a_lnx.tar.gz
balston commented 6 years ago

It looks like Comsol needs Matlab R2018a installed. We haven't installed R2018a on Grace yet so doing this first.

balston commented 6 years ago

Matlab R2018a installed on Grace.

balston commented 6 years ago

I've now done a test install of Comsol on Grace. Will now try and run one of the multicore tutorial examples and sort out the module file. Use:

COMSOL Multiphysics>Tutorial models>micromixer_cluster

Also need to create a reserved application group.

balston commented 6 years ago

I'm creating the group lgcomsol for the Chemical Eng installation of COMSOL. I'm adding my ucaabaa test account.

balston commented 6 years ago

Group set up. Changed ownership of files and permissions:

cd /shared/ucl/apps/Comsol/
chgrp -R lgcomsol comsol53a
chmod -R o-rwx comsol53a
balston commented 6 years ago

Added user requesting Comsol to the reserved application group.

balston commented 6 years ago

Module file pushed and I've got a test job running on Grace 😸

balston commented 6 years ago

The test job finished. However I didn't have time to check that it worked before going on leave.

I've also emailed the user.

balston commented 6 years ago

User has now requested installation on Myriad as their simulation is running out of memory on Grace nodes!

balston commented 6 years ago

Comsol installer copied to Myriad in:

/shared/ucl/apps/Comsol/Installers/COMSOL53a_lnx.tar.gz
balston commented 6 years ago

Installation finished for Comsol 53a plus update 4 on Myriad.

Update will need to be applied to Grace installation.

balston commented 6 years ago

Changed ownership of files and permissions.

Now to try running a test job ...

balston commented 6 years ago

Test job submitted a queuing:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
  62329 0.00000 Comsol_par ccaabaa      qw    08/31/2018 15:17:13                                   36
balston commented 6 years ago

OK job starts, finishes and produces no output! And no error messages either - need to investigate ...

balston commented 6 years ago

At least my next attempt has produced some errors:

Model to run is micromixer_batch.mph
(node-j00a-001:0)
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00002ad0f49fcc30, pid=30802, tid=0x00002ad0f35de100
#
# JRE version: Java(TM) SE Runtime Environment (8.0_112-b15) (build 1.8.0_112-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.112-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libpthread.so.0+0x9c30]  pthread_mutex_lock+0x0
#
# Core dump written. Default location: /lustre/scratch/scratch/ccaabaa/COMSOL/Comsol_parallel_1_62360/core or core.30802
#
# An error report file with more information is saved as:
# /lustre/scratch/scratch/ccaabaa/COMSOL/Comsol_parallel_1_62360/hs_err_pid30802.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 30802 RUNNING AT node-j00a-001
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 30802 RUNNING AT node-j00a-001
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================
Finished.

and:

libi40iw-i40iw_ucreate_qp: failed to create QP, unsupported QP type: 0x4
balston commented 6 years ago

I think the errors have been caused by wrong Fabric settings. I've changed them but now all Comsol licenses are in use so I can't do any further testing at the moment:

 Model to run is micromixer_batch.mph
(node-i00a-002:0)
Node 0 is running on host: node-i00a-002.myriad.ucl.ac.uk
Node 0 has address: node-i00a-002
*******************************************
***COMSOL 5.3.1.348 progress output file***
*******************************************
Fri Aug 31 16:06:06 BST 2018
COMSOL Multiphysics 5.3a (Build: 348) starting in batch mode
Opening file: /lustre/scratch/scratch/ccaabaa/COMSOL/Comsol_parallel_1_62363/micromixer_batch.mph

/******************/
/*****Error********/
/******************/
Could not obtain license for COMSOL Multiphysics.
License error: -4.
Licensed number of users already reached.
Feature:       COMSOL
License path:  /lustre/shared/ucl/apps/Comsol/comsol53a/multiphysics/license/license.dat:
FlexNet Licensing error:-4,132
For further information, refer to the FlexNet Licensing documentation,
available at "www.flexerasoftware.com".
Total time: 7 s.
Finished.
balston commented 6 years ago

I've now been able to run my test job successfully:

Model to run is micromixer_batch.mph
(node-j00a-001:0)
Node 0 is running on host: node-j00a-001.myriad.ucl.ac.uk
Node 0 has address: node-j00a-001.myriad.ucl.ac.uk
*******************************************
***COMSOL 5.3.1.348 progress output file***
*******************************************
Mon Sep 03 12:55:26 BST 2018
COMSOL Multiphysics 5.3a (Build: 348) starting in batch mode
Opening file: /lustre/scratch/scratch/ccaabaa/COMSOL/Comsol_parallel_1_66241/micromixer_batch.mph
Open time: 8 s.
Running: Study 1
Running in distributed mode using 1 nodes.
Running on 2 x Intel(R) Xeon(R) Gold 6140 CPU at 2.30 GHz.
Using 2 sockets with 36 cores in total on node-j00a-001.myriad.ucl.ac.uk.
Available memory: 192.90 GB.
           Current Progress:   0 % - Free Tetrahedral 1
Memory: 874/874 12480/12480
Number of vertex elements: 188
           Current Progress:   0 % - Analyzing domains
Memory: 913/913 12491/12491
           Current Progress:   0 % - Adjusting boundary mesh
Memory: 920/920 12496/12496
Number of edge elements: 1974
Number of boundary elements: 13134
           Current Progress:   1 % - Creating initial tetrahedra
Memory: 922/922 12497/12497
           Current Progress:   1 % - Respecting boundaries
           Current Progress:   1 % - Inserting interior points
Memory: 928/928 12507/12507
           Current Progress:   1 % - Improving element quality
Memory: 937/937 12578/12578
Number of elements: 94439
Free meshing time: 1.84s
Minimum element quality: 0.1826
           Current Progress:   2 % - Finalizing mesh
Memory: 942/942 12583/12583
           Current Progress:   2 % - Boundary Layers 1
           Current Progress:   2 % - Inserting boundary layer elements
Memory: 978/978 12616/12616
           Current Progress:   2 % - Smoothing transition to interior mesh
Memory: 1020/1020 12657/12657
           Current Progress:   3 % - Smoothing transition to interior mesh
Memory: 1021/1021 12658/12658
<---- Compile Equations: Stationary in Study 1/Solution 1 (sol1) ---------------
Started at 3-Sep-2018 12:55:48.
---------- Current Progress: 100 % - 
Solution time: 97 s. (1 minute, 37 seconds)
Physical memory: 7.05 GB
Virtual memory: 16.83 GB
Ended at 3-Sep-2018 13:01:09.
----- Stationary Solver 2 in Study 1/Solution 1 (sol1) ------------------------>
Run time: 336 s.
Saving model: /lustre/scratch/scratch/ccaabaa/COMSOL/Comsol_parallel_1_66241/mymodelresult.mph
Save time: 1 s.
Total time: 344 s.
---------- Current Progress: 100 % - Done
Memory: 6679/12756 16157/23130
Finished.

so I'm going to add the correct fabric setting for Myriad to the module file.

balston commented 6 years ago

Module file updated. I'm now running my test job again on:

balston commented 6 years ago

Informed user.

balston commented 5 years ago

Now requested to be installed on Legion.

balston commented 5 years ago

Installer copied from Myriad to Legion in:

/shared/ucl/apps/Comsol/Installers/COMSOL53a_lnx.tar.gz

readable only by ccsp group.

balston commented 5 years ago

First attempt to install using ccspap2 fails:

Installing to: /shared/ucl/apps/Comsol/comsol53a/multiphysics:

Downloading  Acoustics Module...
Downloading  Acoustics Module Applications...
Downloading  Acoustics Module Documentation...
Downloading  Batteries & Fuel Cells Module...
Downloading  Batteries & Fuel Cells Module Applications...
Downloading  Batteries & Fuel Cells Module Documentation...
Downloading  CAD Import Module...
Downloading  CAD Import Module Applications...
Downloading  CAD Import Module Documentation...
Downloading  CFD Module...
Downloading  CFD Module Applications...
Downloading  CFD Module Documentation...
Downloading  COMSOL Cluster Components...
Downloading  COMSOL Multiphysics...
Downloading  COMSOL Multiphysics Applications...
Downloading  COMSOL Multiphysics Documentation...
Downloading  Chemical Reaction Engineering Module...
com.comsol.install.FlInstException: com.comsol.install.FlAbortException: Service Unavailable
Removing temporary COMSOL installer components...
balston commented 5 years ago

Not sure what caused this. I've re-run the build script again and it has completed without error.

Will now run some tests.

balston commented 5 years ago

Test job used on Myriad submitted requesting 32 cores.

balston commented 5 years ago

Test job on Legion has not worked. Will need to check on Monday.

balston commented 5 years ago

My first test job failed at the end:

/******************/
/*****Error********/
/******************/
The following feature has encountered a problem:
 - Feature: Stationary Solver 2 (sol1/s2)
Undefined value found.
 - Detail: NaN or Inf found when solving linear system using GMRES.
 - Error on node 1: Undefined value found.
Saving model: /lustre/scratch/scratch/ccaabaa/COMSOL/Comsol_parallel_1_460545/mymodelresult.mph
Save time: 1 s.
Total time: 1505 s.
Finished.

This morning I've run the test again using 16 cores in a single node at it has completed successfully:

---------- Current Progress: 100 % - 
Solution time: 205 s. (3 minutes, 25 seconds)
Physical memory: 3.98 GB
Virtual memory: 11.1 GB
Ended at 1-Jul-2019 11:44:48.
----- Stationary Solver 2 in Study 1/Solution 1 (sol1) ------------------------>
Run time: 575 s.
Saving model: /lustre/scratch/scratch/ccaabaa/COMSOL/Comsol_parallel_1_465676/mymodelresult.mph
Save time: 1 s.
Total time: 591 s.
---------- Current Progress: 100 % - Done
Memory: 3692/8785 10643/16454
Finished.
balston commented 5 years ago

emailed user about Legion install available for testing.

balston commented 5 years ago

See if it is possible to add a pre-job start license availability check. May be difficult beacause of the number of different components licensed as part of COMSOL.

mmistry4 commented 3 years ago

Been poking through all the tickets here, seems like the issues with this one are no longer valid.