LANL systems face significant security threats, both insider and outsider. I'm hoping to state the basic considerations for a shared HPC resource and allow others to revise and extend.
Management/Compute isolation
Compute nodes, ideally, will never need to initiate communication with the management plane (comprising the image servers, service nodes, fabric managers, orchestration, etc that allow the HPC system to work as an ensemble). If such communication is unavoidable, it must be strictly separated from any endpoint that could allow an attacker to escalate privileges and act as a domain administrator.
Rapid Updates
Updating regularly for security hygiene or rapidly when vulnerabilities are detected are central to good security in general, and to LANL operations in particular. Any aspect of the HPC environment that prevents rapid testing and rollout of patches to packages or kernel components should be minimized, noted, and eliminated as soon as possible.
Verification
Red-team testing of specialized (small-community) management software, networks, filesystem projection software, and other components that don't get attention from large vendors or communities should be regularly red-teamed by knowledgable internal developers with an assumption of root access on user-facing nodes.
LANL HPC Security Pillars
LANL systems face significant security threats, both insider and outsider. I'm hoping to state the basic considerations for a shared HPC resource and allow others to revise and extend.
Management/Compute isolation
Compute nodes, ideally, will never need to initiate communication with the management plane (comprising the image servers, service nodes, fabric managers, orchestration, etc that allow the HPC system to work as an ensemble). If such communication is unavoidable, it must be strictly separated from any endpoint that could allow an attacker to escalate privileges and act as a domain administrator.
Rapid Updates
Updating regularly for security hygiene or rapidly when vulnerabilities are detected are central to good security in general, and to LANL operations in particular. Any aspect of the HPC environment that prevents rapid testing and rollout of patches to packages or kernel components should be minimized, noted, and eliminated as soon as possible.
Verification
Red-team testing of specialized (small-community) management software, networks, filesystem projection software, and other components that don't get attention from large vendors or communities should be regularly red-teamed by knowledgable internal developers with an assumption of root access on user-facing nodes.
What else do we need in here?