andrewstucki commented 4 years ago

Problem

PID re-use is common for both Windows and POSIX systems across:

Long periods of time
OS restarts
Lots of quickly spawned processes

Identifying a process uniquely across time by generating some sort of unique identifier with more specificity than standard system-level PIDs is a fairly common in auditing and security analysis. For example, Sysmon generates unique guids for processes by roughly incorporating:

the machine ID of the OS installation
the process start time
the access token corresponding to the logged in user's session

Clearly some of this is pretty specific to Windows internals and can't necessarily be re-used across POSIX systems.

The thought is to:

Introduce the concept of a unique_pid into ECS, and
Potentially introduce something like network "community id" except for processes as a well-known standard for generating this value

If we decide that this is a useful construct to introduce into ECS, we have three potential options I can think of:

Introducing the concept of unique_pid and leaving it implementation defined (skipping 2 above).
Standardizing on the generation mechanism and leaning towards a "globally unique" identifier that incorporates information about the host the process is running on. This would allowsus to uniquely identify a process in a bucket of time series data that includes information from multiple hosts.
Standardizing on the generation mechanism and leaning towards a "locally unique" identifier based solely off of the process metadata itself. This would allow us to uniquely identify a process in a bucket of time series data corresponding to a single host.

Here are some thoughts on each.

Implementation Defined

We could punt on the idea of specifying a generation algorithm and let an ECS field be implementation defined. In this case, we'd allow for each data source (beats, endpoint, etc.) to ensure its own generation method is unique enough for its own use cases. Main problem with this is that by punting on making the generation mechanism standard, we lose the ability for applications to correlate events from multiple data sources.

Globally Unique

While this potentially buys you the ability to easily correlate processes from disparate sources, the difficult issue is being able to incorporate a standard mechanism for truly uniquely identifying a host where a process is running. For example, in the case of the Sysmon guid generation, the host identifying material--the machine ID--is tied explicitly to the installation of the OS. This potentially becomes problematic when you're using any sort of VMs/VDIs/Containerized systems as you could have multiple VMs with the same installation information contained within them.

Locally Unique

Ideally you'd be able to generate something like this solely from information about the process itself without using material that attempts to uniquely identify the host. This would allow someone trying to correlate processes across data sources to do something like construct a tuple of unique_pid and some other material host.id, host.ip, etc. in their own context to try and uniquely identify a process amidst events from multiple sources.

Proposal

My initial preference is to try and go the route of:

Specifying a simple generation algorithm and
Making the pid globally unique with the caveat that if you want to be robust against some sort of re-use across baked OS images, you might need to take care to include additional identification material as some sort of tuple.

I'm going to shamelessly pitch a version of what @rw-access did in an initial internal discussion we had around unique pid generation with the following rough idea around what goes into forming a unique pid:

hash(machine identifier, system pid, process start time)

With the definition of how we get machine identifier needing to be standardized -- something like /etc/machine-id from Linux, HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid for Windows, and whatever equivalent for Mac/any other system?

In ECS we could then create a new field (or two) corresponding to unique_pid (and unique_ppid) under the process namespace.

Thoughts?

ferullo commented 4 years ago

Implementation Defined

I like this option. The other two are nice but they come with a fair amount of complexity. I'd rather go this route at first if not forever. Could we include an enum next to the ID describing which algorithm was followed? At first only "product-unique" would be supported. "ecs-local-unique" could be added later, for example.

Globally Unique

I don't like this option. It sounds nice but it is an awful lot to expect. It requires some sort of central management of the globally unique data source for many products to share. An individual product could implement this, but that would devolve into a clever version of "Implementation Defined".

Locally Unique

I think we'd have to define a series of schemes to use for the system data for each OS (in case a first attempt at gathering a machine identifier, e.g. /etc/machine-id doesn't exist on an endpoint). This is do-able, but I think it would be overshooting.

Realistically, if process start time (down to nanoseconds) and pid are included in events from two different products the match up can be done without shared unique pid with pretty high confidence. And if implemented that way by ECS consumers cross-product match-ups would work even if the products didn't "try" to join this scheme.

andrewstucki commented 4 years ago

Just to clarify these two bits:

It requires some sort of central management of the globally unique data source for many products to share.

in case a first attempt at gathering a machine identifier, e.g. /etc/machine-id doesn't exist on an endpoint

So my distinction between "globally" unique and "locally" unique is to distinguish between generating identifiers that:

globally: attempt to incorporate unique host identification material (like something from /etc/machine-id for example) or
locally: don't do the above but solely use process-level information without trying to incorporate some sort of fingerprinting of the host

not intending us to have some sort of centralized data source.

@dferullo-elastic : It sounds like with that clarification you're advocating for the "implementation defined" idea with maybe some additional metadata to allow for what I'm calling "globally unique" in the future?

rw-access commented 4 years ago

Ideally for me, this field is implementation defined. Anything else seems impossible.

I think that this generation will have to happen on the endpoint, and it would be hard to tell every endpoint solution that has its data ingested into ES to "align" with our approach. It seems unreasonable to me for an end-user of ECS to expect correlation across different data sources when each has a different approach.

andrewstucki commented 4 years ago

it would be hard to tell every endpoint solution that has its data ingested into ES to "align" with our approach

I guess my thought is that it all depends on how we define this field. For example, here are some open source projects that define fingerprinting mechanisms for trying to uniquely identify TLS handshakes, SSH connections, and network flows:

https://github.com/salesforce/ja3 https://github.com/salesforce/hassh https://github.com/corelight/community-id-spec

ECS already has a field included in it for specifying values that correspond to community-id network flows: network.community_id, so if we want to, why not make something similar for processes that could be used across multiple sources?

With regard to incorporating an implementation defined unique_pid, my main thought is that it gets you far less utility than having something well-known, and, if you really want to tie a field to a particular implementation, then you could always include a custom field namespaced to your implementation (i.e. mysource.process.unique_id). But I also don't completely write-off the utility of having a place in core ECS just to drop some sort of ad-hoc fingerprint of your choice.

rw-access commented 4 years ago

Community id isn't actually unique though. That's the point. It's a deterministic function f(source.address, source.port, destination.address, destination.port), that doesn't intend to solve the reuse problem that we have with PIDs.

With unique_pid, if we did have a function, then we have to derive the pid at process creation time, because all of the fields necessary aren't guaranteed to exist in subsequent events. We either have to do this within the sensor, or statefully track it at ingest, which isn't really an option.

A community shared hashing algorithm would be awesome to start and brainstorm, but I think we need room in ECS for implementation-specific globally unique, non-reusable pids. For now at least, I see that as the most accommodating of different solutions.

jrmolin commented 4 years ago

if we punt on the implementation, we should at least agree on a format that the id will take, and perhaps discuss characteristics of the id.

does any of the following make sense to everyone?

sortable (some fields can be random, but others define an ordering by pid/start time/ppid)
have some ancestry baked in (thinking just ppid)
is a hyphen-separated string of hex characters
or is a hyphen-separated string of base64-encoded characters
or has no hyphens at all

i have looked at options, and i can write up some pseudocode (or actual code) and some collision metrics for the top three (unless the actual collisions i come across are just terrible).

ferullo commented 4 years ago

I agree with @rw-access that we should support implementation-specific. If we want to also define what the "ECS standard" is I think that's ok, but the schema should support either mode.

Thanks for correcting my understanding @andrewstucki , I now think "globally unique" is do-able for products and should remove the need for "locally unique".

Regarding @jrmolin's point, I agree, it would be nice to come to a decision on the format of an ID. I'll add to his list that if it doesn't have to look like a UUID is there an expected max-length for the string and/or permissible characters?

For "sortable", I'd vote no. The only way we could achieve that is by mandating the ID format, which is at odds with allowing an implementation specific option. And for "ancestry" I also vote no for the same reason, as well as because ppid can change over time for a process (e.g. a POSIX process that is deamonized and re-parents to init).

cwurm commented 4 years ago

hash(machine identifier, system pid, process start time)

This is exactly what the Auditbeat system/process dataset does: https://github.com/elastic/beats/blob/dd99d7e1401ce959ac5e7d0035145b625713a669/x-pack/auditbeat/module/system/process/process.go#L120-L127

It uses it to fill a process.entity_id field, e.g. https://github.com/elastic/beats/blob/dd99d7e1401ce959ac5e7d0035145b625713a669/x-pack/auditbeat/module/system/process/_meta/data.json#L14

With the definition of how we get machine identifier needing to be standardized -- something like /etc/machine-id from Linux, HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid for Windows, and whatever equivalent for Mac/any other system?

Beats fills host.id for each platform like this:

Windows: HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid (machineid_windows.go)
Linux: /etc/machine-id, /var/lib/dbus/machine-id, or /var/db/dbus/machine-id (machineid.go)
macOS: gethostuuid() (machineid_darwin_amd64.go)

andrewstucki commented 4 years ago

@cwurm nice, good to know we're not treading new ground then. What are your thoughts around the idea of formalizing a spec for something like entity_id (if that's what we want to call it?) that, since it would be formalized, would allow for alignment/pivoting across data sources?

Or, do you think that, alternatively, creating an implementation defined field for unique identification of processes would be worthwhile?

cwurm commented 4 years ago

@andrewstucki We could, no reason to do incompatible things in our own products if we don't have to.

I'm not sure how often several tools would be observing the same entity, like a process. Beats/Endpoint are a special situation, and even they I suspect would most often not run at the same time on the same machine. Network community ID works because the information for it is present everywhere on hosts or in the network. In contrast, the host.id is usually kept private on a machine.

ferullo commented 4 years ago

The scheme Beats uses looks good and using it for all Elastic products is a nice idea. How that is laid out in ECS and how data from products that don't use that scheme is laid out is outside my expertise.

One caveat for the Beats scheme is that it does not include hashing an id unique to each Beats install. A nice aspect of this is that it would make it easy for other products (Elastic produced or otherwise) to generate the same unique process id for cross-product correlation. A downside is that it adds the risk of id collision across VDI or VDI-like scenarios, when two VM instances might share the same hardware GUID. This is a trade-off I'm happy with if others are.

andrewstucki commented 4 years ago

So, I took a second look at the entity_id stuff that auditbeat is doing, and I noticed that it's currently truncating its hash calculation to 12 bytes here--base 64 encoded, that gives you a 16 character string.

Not to completely derail this with particularities of implementation details, but I just wrote some quick calculations for hash collisions (assuming my shallow understanding of birthday attack collisions is right: @jrmolin ) based off of the fairly small ids that auditbeat is churning out.

Here's the little calculator https://jsfiddle.net/zbvnd7q5/.

Basically this assumes you have 100k hosts, each host creates roughly 1k processes per second on average, and in a 3 year time period, you want to have only a 0.1% chance at a hash collision (a kind of realistic worst-case scenario). Looks like under those circumstances if we'd actually want to uniquely identify something we'd have to choose at least a 20 character string, with only a 16 character string you'd risk collision within a few days at this scale.

So, looks like, internally, for the purposes of standardizing on something, we can't use the beats implementation and expect not to have collisions at scale. We'd have to keep more of the ID around. Chatting with @ferullo some more, if we don't mind long IDs, we could likely come up with an alternative encoding scheme that doesn't require hashing, which is really only there for getting a uniform distribution for a fixed ID length anyway. I can open up an issue internally to maybe discuss this between both beats and endpoint implementation so we can standardize.

Thoughts about opening up that discussion internally @cwurm , @ferullo , @jrmolin ?

All this said, I still think there is quite a bit of value in creating an open standard for what we want this value to be--it gives us the best compatibility out of the box with multiple data sources for things like correlating data in visualizations in Kibana, which is really one of the main drivers for ECS.

rw-access commented 4 years ago

This might sound a little silly, but there's a downside with having multiple agents populate this field the same way. It's no longer unique.

What would the impact be for rules or other analysis if there are two process creation events for the same pid with the same unique pid? That complicates things and could break things. It could have some advantages, but I think this would require more brainstorming.

I'm okay with documenting how we generate a UUID (my preference, but understand if we want a b64 encoding instead of hex because its a smaller representation). But I'm still resistant to the idea of multiple solutions have the same methodology for generating a pid.

There is a semi-pressing need for having a globally unique pid with the Endpoint (former Endgame) data. I would rather prioritize figuring that out first, and then later coming back to the approach of a shared algorithm for generating a pid.

webmat commented 4 years ago

Thanks @andrewstucki for opening this well-written issue. And thanks everyone for chiming in!

I'd like to explore the requirement a little more. I feel like it will help steer us in one of the proposed directions, and help us accept / document the associated downsides.

The one requirement that's very straightforward and not under debate: is having a more unique PID that doesn't get reused as fast as the 65k numeric.

But I'd like to clarify some other potential requirements:

Should this ID be opaque, or can it be a straightforward concatenation?
- hashing it makes it opaque and uniform in length. A straightforward concatenation of a machine ID, a PID and a nanoseconds-precision unix timestamp is gonna yield looong IDs :-)
- not hashing it can open possibilities such as sortability. I'm not convinced we need sortability for this, I'd like to hear the case for it.
- transparency can offer some resiliency against timestamps of various precisions. With a bit of work, one could correlate a unique PID from two sources, one that uses nanoseconds precision and one that truncates at millis (as an example), if the format is transparent like [machine id]-[pid]-[start time, unix ts]
Is unicity across multiple hosts necessary, or just a nice to have?
Implementation simplicity
- Ideally I'd like something that can be calculated from a pipeline, just as much as from an endpoint.
- As long as the agent sends along the host.id as metadata on process-related events, I think we can calculate even the globally unique PID from a pipeline.
If we decide to hash / transform, which algo should we use?
- If we only need uniform length out of hashing, MD5 is likely our best bet
- If we want reversability (but not uniform length), base64
- If we need something more solid than MD5, then it depends on which characteristic we're looking for.
Do we need this better PID to be the same across products?
- I think so. It's common for big environments to have a heterogeneous set of sources. A better PID that can be applied to them all would make it easier leverage.
What are the consequences of collisions across hosts?
- Any combination of timestamp, PID and potentially host ID gives us something dramatically better than existing PIDs already. But what's the actual impact of having the same unique PID popping up from two hosts, or within a few days?
Stability of the calculation
- Machine ID can change when someone decides to do it, but typically this only happens upon initial configuration of the machine or VM. So I'm not too concerned about it. Someone who deliberately resets a machine ID without restarting the host will have a lot of different problems.
- Considering the parent PID in the calculation would lead to problems, as @ferullo points out.
- The description of the Windows process GUID intrigues me: it seems to require a user to be logged in, which makes sense for processes started by the user. But what about system processes? Does it use the system user's ID for the calculation?
Future growth
- I would not add two fields for now, I would only add one for the unique PID.
- I like Community ID's version (1:) before the hash. This leaves room to improve the algorithm in the future, and isn't onerous in the present. We can simply ignore it until it becomes an issue.

All of the above only adds questions rather than resolving them, I know :-)

webmat commented 4 years ago

I'm curious if we can uniformly get nanosecond timestamps on process start times, across OSes and agents. Should we go with milliseconds to make this easier to adopt in various situations?

rw-access commented 4 years ago

Windows gives us .1 microseconds. I would prefer to not lose any granularity if possible and if ES has that degree of precision. We can always zero out the LSBs

webmat commented 4 years ago

At this time, Logstash offers the fingerprint plugin, which can hash using many algorithms, as well as byte64 encode.

As of 7.6, neither Beats nor Elasticsearch have processors that allow for performing arbitrary digests. Neither's scripting engine seems to allow using libraries to perform digests, either.

This is not a blocker for this discussion, but I just wanted to point out this fact. We'd likely want this added to at least one of them, ideally both.

andrewkroh commented 4 years ago

The beats processor for fingerprinting is new in 7.6. See https://www.elastic.co/guide/en/beats/filebeat/7.6/fingerprint.html.

ferullo commented 4 years ago

I like base64([endpoint id]-[pid]-[start time seconds.nanoseconds]). I don't think we should shoot for uniformity across all products, at least to start. As long as uniform length is not needed, giving the unique pid reversibility by the user would let them correlate across products on their own. If that ends up being important to users we can always change the algorithm in future releases in one/all products.

Note:

I think endpoint id is better than machine id because VDI hosts can share a machine id. If we want global uniqueness its important to consider that -- esp for VMs that are spun up in an already running state before installing endpoint. If we don't need global uniqueness there's no reason to include an endpoint/machine id at all. Notably though, using endpoint id doesn't jive with all products sharing a unique pid formula.
Including pid is obvious.
If we're going for reversibility unix seconds.nanoseconds seems a little clearer than using Microsoft time. Its my understanding that doesn't lose any information in the unique pid, though really the time in the unique pid should only be used to de-conflict pid re-use. other data in the event should be the canonical place the process start time is placed.

andrewstucki commented 4 years ago

So just to circle back around on what we decided yesterday in our ECS meeting, and what I'll try and PR today, we're just going to start with the "implementation defined" option for now. This will keep us from:

having to settle the question of what goes into these ids, and 2, having to make any breaking changes in beats in order to support our needs on the endpoint side from a hash collision perspective

I'd love to continue the discussion to see if we can move towards uniformity, but for now I'm going to PR the addition of a new field: process.entity_id. This will be a user-defined unique identifier for a process and can be anything from a sysmon process GUID to a beats-style entity id, to the endpoint implementation. Even with leaving it up to the user to define, it'd still be worthwhile to introduce.

CC: @ferullo

webmat commented 4 years ago

Thanks @andrewstucki, looking forward to your PR :-)

I understand the potential pitfalls of using a machine ID vs their reuse in virtualized environments. Personally I've always considered this to be a bad practice to not fix the machine ID, when spinning up VMs.

SIEM decided a while ago to use host.name instead of host.id as the unique identifier. But catering to this common misconfiguration of VMs has led to the following situation: When deploying Endpoint across a fleet of Macs, the SIEM is now dealing with tons of hosts named "macbook-pro" 😄 This is actually totally normal, and would not require any workarounds if we used the machine ID 🙂

In my mind the workaround for dealing with duplicated IDs is to prod the users to do it right. A bit of docs giving pointers for the most common platforms will not only help them with their Elastic deployment, but likely also with other products they're using.

To be clear, using a deployment ID is a good pragmatic approach that works for one product. But if we're going to try to come up with a shared algorithm a la community_id, a tool's deployment ID can't be used.

elastic / ecs

Unique PID Fields and Generation #672

Problem

Implementation Defined

Globally Unique

Locally Unique

Proposal

Implementation Defined

Globally Unique

Locally Unique