kevinxufs commented 4 years ago

:scroll: Description

This issue will cover all issues related to auditing, monitoring and protecting. Currently in the NHS Cloud security principles there are nine rows that cover this sort of issue:

[ ] Implement a GPG13 compliant Protective Monitoring solution.
[ ] Maintain an accurate inventory of the assets which make up the service, along with their configurations and dependencies.
[ ] Ensure changes to the service are assessed for potential security impact, and the implementation of changes are managed and tracked through to completion.
[ ] Undertake patching or vulnerability management for the guest operating system and application components, within the NCSC best practice timescales
[ ] Put in place appropriate monitoring solutions to identify attacks against their applications or software.
[ ] Have an incident management process to rapidly respond to attacks.
[ ] Regularly review the access attempts to identify unusual behaviour
[ ] Use the audit data as part of an effective pro-active monitoring regime.
[ ] Utilise integrated security monitoring and policy management facilities to help detect threats and weaknesses, due to poor design or mis-configuration.

:strawberry: Desired behaviour

We should be looking into the following:

Monitoring for Azure log analytics
Azure sentinel
Azure security

Some of the above issues are non technical policy changes (e.g. 2,7,8) to do with our own due diligence. Should probably expect these to require deciding on a policy and then writing it up to evidence.

The others will require an exploration of the above security and monitoring options and should engage with Ian in this discussion.

kevinxufs commented 4 years ago

@jemrobinson @JimMadge @martintoreilly (for reference)

Following our discussions today and last week, here's my assessment of this issue:

We ooriginally started with around 9 or so points based on the NHS cloud security principals, which was then converted into a number of issues. After going through all of this we can now see that there is a lot of work to be done.

Part of this work is implementation, for example we may just need to implement something like Azure Sentinel or an equivalent solution. Other parts of our work are more difficult, for example questions like how we are to use audit / monitoring data to inform our security decisions. For example in the case of logs for user access, it is not clear how we use the information - successful and unsuccessful user access may be both good / bad from our perspective.

Given the large ammount of work involved here (and the long expected timelines for achieving accreditation) we discussed the possibility of first doing larger infrastructure changes. Investing in these infrastructure changes would allow us to more easily do further development, including logging / monitoring changes. In particular, without changing our current infrastructure to something like Ansible, it would be very difficult to have any kind of automated inventory management system (one of the NHS requirements).

Here is an overview of our planned approaches for our various monitoring / logging issues, and how this may be affected by our architectural changes (ARCH):

GPG13 (see #781 for full list of issues)

Network time protocol
Boundary and Firewall logs
User session logging
log data backups

Implement inventory management.

Without some kind of infrastructure as code, this will be extremely difficult as we would have to develop our own kind of inventory management system.

ARCH: This would be trivial to do if we switch our underlying architecture to use something like Terraform or Ansible

Patching and vulnerability management.

This can be fixed for our Windows VMs by simply enabling windows updates. We currently have a group policy that seems to be blocking this, but that can be changed.

This will be more difficult with our Linux VMs. We could enable automatic updates for Ubuntu 18.04, but there is a bit of nuance here, as it is difficult to seperate security updates from regular updates.

ARCH: Switching architecture to something like Ansible / Terraform has a minor benefit in that will be much easier to implement manual interventions to Linux machines.

Incident management process

We should check this with IT to see what they are doing

Use audit data as part of protective monitoring

In general we should be using some kind of logging system to generate our audit data. We can leave it open at this point as to what this logging system may be. This approach is primarily targetted at user sessions.

The main challenge is determining what to do with our logging data. In particular, it is not clear what kind of logs are 'good' and 'bad'. For example, it isn't clear whether a successful login is a good log. Normally it would be - but if someone managed to hack in to our systems then a successful login would be presumably be bad for us. We want to be informed when something bad happens, but it's describe rules for what is bad.

Our initial thoughts for some suspicious things are:

A big read from the data container
Repeated (failed) login attempts from the same user
Significantly differ locations for the same user.
Login from a user name that does not exist

Next Steps

I think there are two key things.

Change infrastructure

First we need to establish whether we are going to invest development time now in making these architectural changes. Doing so would provide us with great benefits in the long term, and is pretty much a necessity to do effectively inventory management. On the other hand, it would take a while to get it running, and in the mean time we would not really be responding to any of these issues.

My suggestion (given that we expect security accreditation to take a while, and that we are not necessarily the critical path) is that this is worth doing.

If we do do this, then the next step I think is for us to draw out what an underlying architecture for this would look like and how it might benefit us when we do more development.

Prioritise issues

We now have a huge number of issues tagged as monitoring and / or nhs cloud security. Once we've committed to either changing infrastructure or not, we should then establish the dependancies between the different issues we have (e.g. we should probably establish how to generate logs before we think about a log life cycle policy), and then prioritise them accordingly. Many of the issues are related.

In this prioritisation process we should think a bit about at what stage do we need to decide on an exact solution / implementation.

kevinxufs commented 4 years ago

monitoring flow

See diagram for dependencies.

Note #808 is independant and is sorting out machine times.

819 is the first thing we would need to do as otherwise there would not be anywhere to send the logs to.

rwinstanley1 commented 3 years ago

@JimMadge I know you are looking into the auditing and monitoring solutions for the DSH. Is this issue something that you would like to be open?

assigned to me as part of DSPT - I think the work you are currently doing covers us under DSPT but its whether any of these are things we want to consider for wider functionality.

JimMadge commented 3 years ago

@rwinstanley1 Do you think DSPT supersedes this? If so I'm happy to close this and the related issues.

I think the DSPT issues on monitoring/logging have good detail of our plans going forward.

rwinstanley1 commented 3 years ago

@JimMadge yes my instinct is that the DSPT monitoring logging works supersedes this and further work would be dependent on the outcomes from that.

I'll close this and related issues if you're happy with that!

alan-turing-institute / data-safe-haven

Investigate auditing, monitoring and protecting solutions for Safe Haven #790

:scroll: Description

:strawberry: Desired behaviour

GPG13 (see #781 for full list of issues)

Implement inventory management.

Patching and vulnerability management.

Incident management process

Use audit data as part of protective monitoring

Next Steps

Change infrastructure

Prioritise issues

819 is the first thing we would need to do as otherwise there would not be anywhere to send the logs to.