UCL-RITS / rcps-buildscripts

Scripts to automate package builds on RC Platforms
MIT License
39 stars 26 forks source link

Install Request: Update Electrical and Electronic Eng COMSOL to V6.1 [IN06125518] [IN06134254] #548

Open balston opened 10 months ago

balston commented 10 months ago

This is effectively a repeat of:

https://github.com/UCL-RITS/rcps-buildscripts/issues/515

but for Electrical and Electronic Eng whose reserved application group is legcomsl

We should still have to installer but if not, we can get it via the RF ticket.

balston commented 10 months ago
/shared/ucl/apps/comsol/

is the location for EEE COMSOL installations. The COMSOL 6.1 installer is in:

/shared/ucl/apps/pkg-store/ COMSOL61_lnx.zip

on both Myriad and Kathleen.

balston commented 10 months ago

We have a build script for 6.1 but this is for the Chem Eng installation. Will need to copy and modify for EEE.

balston commented 10 months ago

Myriad install progressing running:

 ./comsol-6.1_EEE_install 2>&1 | tee ~/Software/COMSOL/comsol-6.1_EEE_install.log

from the ccspapp account. It is quite slow and will be even slower on Katrhleen.

balston commented 10 months ago

It took about 30 minutes for the install to complete. Looks Ok:

    ==========================
    =    Post Build Info     =
    ==========================

    Package label:            comsol/6.1
    Build took place in:      /dev/shm/comsol-build.akqQtLR83X
    Modules were put in:      /shared/ucl/apps/comsol/6.1/.uclrc_modules
    Package was installed to: /shared/ucl/apps/comsol/6.1
    Package size:             8.1GB

    -- First execs (max 10) --
/shared/ucl/apps/comsol/6.1/license/glnxa64/lmcomsol.service
/shared/ucl/apps/comsol/6.1/license/glnxa64/LMCOMSOL
/shared/ucl/apps/comsol/6.1/license/glnxa64/lmutil
/shared/ucl/apps/comsol/6.1/license/glnxa64/lmgrd
/shared/ucl/apps/comsol/6.1/bin/setuplibpath
/shared/ucl/apps/comsol/6.1/bin/servers/webbridge/conf/Catalina/localhost/ROOT.xml
/shared/ucl/apps/comsol/6.1/bin/servers/webbridge/conf/Catalina/localhost/webbridge.xml
/shared/ucl/apps/comsol/6.1/bin/servers/webbridge/conf/server.xml
/shared/ucl/apps/comsol/6.1/bin/servers/webbridge/webapps/ROOT/WEB-INF/web.xml
/shared/ucl/apps/comsol/6.1/bin/servers/webbridge/webapps/ROOT/error.html

    ==========================
Note that module files will need altering to handle setting I_MPI_FABRICS per-cluster. This is not currently automated.

Next task create a module file. Should be straightforward as the current 6.1 Chem Eng one can be used as a template.

balston commented 10 months ago

Module file done. Now need to test if it works against the EEE License Manager.

balston commented 10 months ago

I can't use my account for testing as I'm not in the correct group (I'm in the Chem Eng one) so I will use ccspapp

balston commented 10 months ago

Test job submitted:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
9999883 0.00000 Comsol_par ccspapp      qw    08/22/2023 18:11:35                                    8
balston commented 10 months ago

The first test job failed:

-          Current Progress:  11 % - Assembling matrices
Memory: 5501/7111 12255/14824

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 61514 RUNNING AT node-b00a-013
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================
Finished.

This was with 1GB memory per core. I've resubmitted it with 4GB per core and it works:

---------- Current Progress: 100 % - 
Solution time: 320 s. (5 minutes, 20 seconds)
Physical memory: 5.08 GB
Virtual memory: 12.62 GB
Ended at Aug 23, 2023, 11:34:57 AM.
----- Stationary Solver 2 in Study 1/Solution 1 (sol1) ------------------------>
Run time: 820 s.
Saving model: /lustre/scratch/scratch/ccspapp/COMSOL_Examples/6.1/Comsol_parallel_1_4177/micromixer_batch.mph_batch_output_4177.mph
Save time: 1 s.
Total time: 835 s.
---------- Current Progress: 100 % - Done
Memory: 4738/8580 11515/15963
Finished.

The 1GB job worked with the Chem Eng COMSOL 6.1 version back in May so it looks like some jobs need more memory to run that they did.

balston commented 10 months ago

I've asked Electrical and Electronic Eng to check their LM logs to make sure licenses were issued from the correct LM!

balston commented 10 months ago

Unfortunately I had forgotten to save the updated module file before uploading it to Myriad so it still have the Chem Eng LM in there. I've now fixed it so I need to re-submit the test job.

balston commented 10 months ago

Test job fails with:

Model to run is micromixer_batch.mph
(node-b00a-011:0)

ERROR: A start error occurred on node 0: License_error_-15_Cannot_connect_to_license_server_system
Finished.

Investigating ...

balston commented 10 months ago

The problem is with the EEE firewall. An email has been sent asking for this to be updated.

balston commented 10 months ago

EEE have made a firewall change which appears to have worked. Just need to confirm from the EEE LM logs.

balston commented 10 months ago

Yay we have:

18:03:33 (LMCOMSOL) OUT: "SERIAL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:03:33 (LMCOMSOL) IN: "SERIAL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:03:33 (LMCOMSOL) OUT: "SERIAL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:03:33 (LMCOMSOL) IN: "SERIAL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:03:33 (LMCOMSOL) OUT: "SERIAL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:03:33 (LMCOMSOL) OUT: "CLUSTERNODE" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:03:33 (LMCOMSOL) OUT: "COMSOLUSER" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:03:47 (LMCOMSOL) OUT: "COMSOL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:04:33 (LMCOMSOL) IN: "SERIAL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:04:33 (LMCOMSOL) OUT: "SERIAL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:14:26 (LMCOMSOL) IN: "COMSOL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:14:26 (LMCOMSOL) IN: "SERIAL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:14:26 (LMCOMSOL) OUT: "SERIAL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:14:26 (LMCOMSOL) IN: "COMSOLUSER" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:14:26 (LMCOMSOL) IN: "CLUSTERNODE" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>
18:14:27 (LMCOMSOL) IN: "SERIAL" [ccspapp@node-b00a-005.myriad.ucl.ac.uk](mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk)<mailto:ccspapp@node-b00a-005.myriad.ucl.ac.uk>

in the EEE LM logs. It works!

balston commented 10 months ago

Done on Myriad and requesters informed.

Will install on Kathleen next.

balston commented 9 months ago

Installing on Kathleen now.

balston commented 9 months ago

The Kathleen installation is taking a very long time in its final stage. I started it just after lunch.

balston commented 9 months ago

It's moved from running the chgrp command to the chmod command which is them last major step in the install script.

balston commented 9 months ago

The install has finally finished.

Submitting an 80 core test job now.

balston commented 9 months ago

My test job on Kathleen worked.

User informed.