dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Increase HTCondor spool ramdisk partition from 8GB to 12GB #12156

Open amaltaro opened 1 month ago

amaltaro commented 1 month ago

Impact of the new feature WMAgent

Is your feature request related to a problem? Please describe. With the migration to Alma9, we also started seeing vm_kill and condor_schedd restarts every now and then. Discussing these with the SI team (Marco M.), he suggested to increase the production WMAgent HTCondor spool area, which is currently defined at 8GB size.

Describe the solution you'd like Follow up with the VoC and gradually increase the /mnt/ramdisk partition area from 8GB to 12GB. Nodes that are not in use can be modified right away, while those that are active will have to wait until we can stop services.

Describe alternatives you've considered None

Additional context Latest condor_schedd restart and vm_kill dates from Oct/22/2024, on vocms0282.

amaltaro commented 1 month ago

Relevant JIRA ticket: https://its.cern.ch/jira/browse/CMSVOC-598

amaltaro commented 2 days ago

Just a quick update, 6 out of 8 nodes are now set to 12GB of RAM. The other 2 nodes are currently in use and we cannot make this change until we can actually drain those agents/nodes. Further details in the ticket above.