Open balston opened 1 year ago
/shared/ucl/apps/comsol/
is the location for EEE COMSOL installations. The COMSOL 6.1 installer is in:
/shared/ucl/apps/pkg-store/ COMSOL61_lnx.zip
on both Myriad and Kathleen.
We have a build script for 6.1 but this is for the Chem Eng installation. Will need to copy and modify for EEE.
Myriad install progressing running:
./comsol-6.1_EEE_install 2>&1 | tee ~/Software/COMSOL/comsol-6.1_EEE_install.log
from the ccspapp account. It is quite slow and will be even slower on Katrhleen.
It took about 30 minutes for the install to complete. Looks Ok:
==========================
= Post Build Info =
==========================
Package label: comsol/6.1
Build took place in: /dev/shm/comsol-build.akqQtLR83X
Modules were put in: /shared/ucl/apps/comsol/6.1/.uclrc_modules
Package was installed to: /shared/ucl/apps/comsol/6.1
Package size: 8.1GB
-- First execs (max 10) --
/shared/ucl/apps/comsol/6.1/license/glnxa64/lmcomsol.service
/shared/ucl/apps/comsol/6.1/license/glnxa64/LMCOMSOL
/shared/ucl/apps/comsol/6.1/license/glnxa64/lmutil
/shared/ucl/apps/comsol/6.1/license/glnxa64/lmgrd
/shared/ucl/apps/comsol/6.1/bin/setuplibpath
/shared/ucl/apps/comsol/6.1/bin/servers/webbridge/conf/Catalina/localhost/ROOT.xml
/shared/ucl/apps/comsol/6.1/bin/servers/webbridge/conf/Catalina/localhost/webbridge.xml
/shared/ucl/apps/comsol/6.1/bin/servers/webbridge/conf/server.xml
/shared/ucl/apps/comsol/6.1/bin/servers/webbridge/webapps/ROOT/WEB-INF/web.xml
/shared/ucl/apps/comsol/6.1/bin/servers/webbridge/webapps/ROOT/error.html
==========================
Note that module files will need altering to handle setting I_MPI_FABRICS per-cluster. This is not currently automated.
Next task create a module file. Should be straightforward as the current 6.1 Chem Eng one can be used as a template.
Module file done. Now need to test if it works against the EEE License Manager.
I can't use my account for testing as I'm not in the correct group (I'm in the Chem Eng one) so I will use ccspapp
Test job submitted:
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
9999883 0.00000 Comsol_par ccspapp qw 08/22/2023 18:11:35 8
The first test job failed:
- Current Progress: 11 % - Assembling matrices
Memory: 5501/7111 12255/14824
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 61514 RUNNING AT node-b00a-013
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
Finished.
This was with 1GB memory per core. I've resubmitted it with 4GB per core and it works:
---------- Current Progress: 100 % -
Solution time: 320 s. (5 minutes, 20 seconds)
Physical memory: 5.08 GB
Virtual memory: 12.62 GB
Ended at Aug 23, 2023, 11:34:57 AM.
----- Stationary Solver 2 in Study 1/Solution 1 (sol1) ------------------------>
Run time: 820 s.
Saving model: /lustre/scratch/scratch/ccspapp/COMSOL_Examples/6.1/Comsol_parallel_1_4177/micromixer_batch.mph_batch_output_4177.mph
Save time: 1 s.
Total time: 835 s.
---------- Current Progress: 100 % - Done
Memory: 4738/8580 11515/15963
Finished.
The 1GB job worked with the Chem Eng COMSOL 6.1 version back in May so it looks like some jobs need more memory to run that they did.
I've asked Electrical and Electronic Eng to check their LM logs to make sure licenses were issued from the correct LM!
Unfortunately I had forgotten to save the updated module file before uploading it to Myriad so it still have the Chem Eng LM in there. I've now fixed it so I need to re-submit the test job.
Test job fails with:
Model to run is micromixer_batch.mph
(node-b00a-011:0)
ERROR: A start error occurred on node 0: License_error_-15_Cannot_connect_to_license_server_system
Finished.
Investigating ...
The problem is with the EEE firewall. An email has been sent asking for this to be updated.
EEE have made a firewall change which appears to have worked. Just need to confirm from the EEE LM logs.
Yay we have:
18:03:33 (LMCOMSOL) OUT: "SERIAL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:03:33 (LMCOMSOL) IN: "SERIAL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:03:33 (LMCOMSOL) OUT: "SERIAL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:03:33 (LMCOMSOL) IN: "SERIAL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:03:33 (LMCOMSOL) OUT: "SERIAL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:03:33 (LMCOMSOL) OUT: "CLUSTERNODE" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:03:33 (LMCOMSOL) OUT: "COMSOLUSER" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:03:47 (LMCOMSOL) OUT: "COMSOL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:04:33 (LMCOMSOL) IN: "SERIAL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:04:33 (LMCOMSOL) OUT: "SERIAL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:14:26 (LMCOMSOL) IN: "COMSOL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:14:26 (LMCOMSOL) IN: "SERIAL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:14:26 (LMCOMSOL) OUT: "SERIAL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:14:26 (LMCOMSOL) IN: "COMSOLUSER" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:14:26 (LMCOMSOL) IN: "CLUSTERNODE" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:14:27 (LMCOMSOL) IN: "SERIAL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
in the EEE LM logs. It works!
Done on Myriad and requesters informed.
Will install on Kathleen next.
Installing on Kathleen now.
The Kathleen installation is taking a very long time in its final stage. I started it just after lunch.
It's moved from running the chgrp command to the chmod command which is them last major step in the install script.
The install has finally finished.
Submitting an 80 core test job now.
My test job on Kathleen worked.
User informed.
This is effectively a repeat of:
https://github.com/UCL-RITS/rcps-buildscripts/issues/515
but for Electrical and Electronic Eng whose reserved application group is legcomsl
We should still have to installer but if not, we can get it via the RF ticket.