P10: Data Discovery, Location and Context

GuillaumeRoss commented 1 year ago

I feel like this could potentially be merged with just "Discovery" to simplify things, at least in its current state.

See doc for comment placement

stods21 commented 1 year ago

@GuillaumeRoss @merkletrie @ESPLouis: I propose that we keep two sections but signpost these a lot better.

For me, Data Discovery needs to Data Discovery: Data Types and Data Discovery, Location and Context is much more about the flow of information and obtaining context based on data movement, interviews with staff, etc.

So -- the first section is covering the yeah, we have PAN in this repo whereas the second covers we stored Python code in this GH repo and it is used to build a critically important data ingestion function.

We could collapse into one but I think the metrics for success are different. This will need some discussion at the working group, but I think two categories works with some better signposting.

ESPLouis commented 1 year ago

This is an interesting one, we are moving to a place of everything as code. Security, Infrastructure and Business Process etc. Do people think for classification we should be proposing a set of base labels to describe context or am I just muddying the waters here.

There is a bit arround this on the Microsoft assurance pages: https://learn.microsoft.com/en-us/compliance/assurance/assurance-data-classification-and-labels

Location could be dropped into discovery, but I think classification should stand on its own. You discover locations were your data might reside that could be proxy logs, net flow etc next you classify what the data is.

The location plays an important role as critical data stored in a non-resilient and unsecure location would be high risk

stods21 commented 1 year ago

I think labels are certainly part of the DSMM model and I also support the everything as code paradigm shift suggestion. I mean, that's why we're creating DSMM to focus on data-first, right? 😄

Where people know what the information is and their sensitivity, then labels/tagging is a solid approach. Challenges come when companies don't know what the data flow is (haven't defined specific repositories or fully threat modelled all the egress points).

I think classification needs to stay as is. Often, the stakeholders involved are non-security folk whereas the tools and methods for Discovery and Location would be those on the tech side.

Do, as of now, we have:

Data Discovery

Finding data:

- all data covered by the Security Program
- Progress based on expansion of coverage

L1 - identify data that exists most of the time - high risk, known repos, structured.  Tagged
L2 -  Aligned with goals at a business department level.  Less structured
L3 - No need to tag.  DDR-like, etc.

Data Location Discovery and Context

Where is data located:

- where all data located?
- how did it get there?
- Data traversal
L1: Manual processes -- Surveys, discussions
L2: Automated discovery -- DLP, Shadow IT tools
L3: Data tracing.

cadderly1 commented 1 year ago

I do think these are two separate areas, but in re-reading everything I wonder if the name "Data Discovery, Location and Context" is too vague. I would support "Data Storage Location and Movement Discovery" in order to differentiate it more clearly from the question "What are our Data" to "Where are our Data"

stods21 commented 1 year ago

@cadderly1 -- pull the latest from Main. I wrote pretty much exactly what you are saying ❤️ 👍

C3WG / DSMM

P10: Data Discovery, Location and Context #19