Ingester: are PIDs unique to subjects?

GaloisInc / adapt

ADAPT software for Transparent Computing

BSD 3-Clause "New" or "Revised" License

6 stars 3 forks source link

Ingester: are PIDs unique to subjects? #50

Closed jamescheney closed 8 years ago

jamescheney commented 8 years ago

It is currently not guaranteed that each Subject node has a unique PID (e.g. there is no index / uniqueness constraint on the PID property).

Segmentation appears to find multiple nodes with , e.g. Dave Archer reported seeing this recently (not sure what the inputs were):

******************************
segment s25493504 pid value 19254
Number of segment elements centered around node with id 25493504 and with pid 19254: 0
******************************
******************************
segment s26939640 pid value 19254
Number of segment elements centered around node with id 26939640 and with pid 19254: 3
{'properties': ['{euid=1000, egid=1000}'], 'id': 53141624, 'gid': ['1000'], 'ident': ['msx9708j+GqjYQEqPCc9cw=='], 'label': 'Agent', 'source': [0], 'userID': ['1000'], 'agentType': [0]}

{'properties': ['{uid=1000, name=firefox, gid=1000, cwd=/home/kexin/Research/BEEP/firefox/firefox-42}'], 'ppid': [19235], 'id': 26939640, 'unitid': [0], 'pid': [19254], 'ident': ['at04ym+cIOGNruxgM3QYlw=='], 'source': [0], 'label': 'Subject', 'subjectType': [0], 'commandLine': ['/home/kexin/Research/BEEP/firefox/firefox-42/firefox-build/dist/bin/firefox -no-remote -profile /home/kexin/Research/BEEP/firefox/firefox-42/firefox-build/tmp/scratch_user']}

{'id': 26419200, 'ident': ['yQUeBmcz6mjFl6TWfSe+zg=='], 'label': 'EDGE_SUBJECT_HASLOCALPRINCIPAL'}

******************************
******************************
segment s26571000 pid value 19254
...

This issue asks to document whether Subject/pid values are (intended to be) unique.

If so, it is suggested to impose this constraint so that it gets checked on ingestion and later phases can assume it (and incidentally this will cause indexing by PID, which seems like a good idea anyway.) This may mean that we have to rename the property for use on the segment node (since reusing this name will violate uniqueness).

If this is not a constraint that the ingester can guarantee (based on incoming CDM data not satisfying a similar constraint) then it would be helpful to document this as well so that we do not rely on it (or get surprised when it is not the case). The segmenter currently does not assume this, and this means that there can be multiple segment nodes with the same associated PID if there are multiple Subject nodes in the raw graph with the same PID.

aperez commented 8 years ago

I remember @jhanley634 mentioning this issue a couple weeks back. The fact is that the OS can re-issue PIDs. But I don't think two processes with the same PID can be alive at the same type, so the tuple <PID, Time> should be unique.

Given a single host, that is.

aperez commented 8 years ago

Here's an example of that happening: https://github.com/GaloisInc/adapt/blob/master/classifier/phase3/pid_demo.py

Actually, browsing language.md, I see that in CDM a subject can be:

enum SubjectType {
        SUBJECT_PROCESS,
        SUBJECT_THREAD,
        SUBJECT_UNIT,
        SUBJECT_BLOCK,
        SUBJECT_EVENT
    }

If I read this correctly, we can have two threads (or two basic blocks) pointing to the same process. In that case, even the tuple I mentioned earlier is not unique. Maybe the uid (?) has to be tracked as well.

Unfortunately at this moment I do not have an example where this happens, though.

berradag commented 8 years ago

I run a couple of queries on the ta5attack1_units.avro file straight after ingestion and here is what I got:

gremlin> g.V().has('pid').values('pid').groupCount()
==>[2594:1, 2597:1, 2598:1, 2599:1, 2601:1, 2610:1, 2611:1, 2617:1, 2618:1, 2622:2, 2623:3, 2626:1, 2627:1, 2629:1, 2630:2, 2631:2, 2632:1, 2633:1, 2634:1, 2635:2, 2636:2, 2637:1, 2638:2, 2639:1, 2640:2, 2641:1, 2642:2, 2643:2, 2644:1, 2645:2, 2646:1, 2647:2, 2648:1, 2649:2, 2650:1, 2651:1, 2652:2, 2653:1, 2654:2, 2655:1, 2656:2, 2657:1, 2658:2, 2659:2, 2660:1, 2661:2, 2662:1, 2663:2, 2664:1, 2665:2, 2666:1, 2667:1, 2668:2, 2669:1, 2670:1, 2671:1, 2672:2, 2673:2, 2674:1, 2675:1, 2676:2, 2677:1, 2678:1, 2679:1, 2680:1, 2681:1, 2682:1, 2683:2, 2684:2, 2685:2, 2686:2, 2687:2]

gremlin> g.V().has('pid',2623).valueMap(true)
==>[label:Subject, id:250024, startedAtTime:[1463797860539000], ident:[NrjZbieJxoKO7lEbMiHfpA==], unitid:[0], pid:[2623], source:[0], subjectType:[0], properties:[{uid=0, name=sudo, gid=1000}], ppid:[2622]]
==>[label:Subject, id:262312, ident:[us3/hhW31moDTAyKuejKqA==], unitid:[0], pid:[2623], source:[0], commandLine:[auditctl -D], subjectType:[0], properties:[{uid=0, name=auditctl, gid=0, cwd=/home/kexin/Desktop}], ppid:[2622]]
==>[label:Subject, id:299128, ident:[gKknlhPnJ4Nv1KK8bIf5RA==], unitid:[0], pid:[2623], source:[0], commandLine:[env PATH=/usr/local/bin:/usr/local/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games LD_LIBRARY_PATH= auditctl -D], subjectType:[0], properties:[{uid=0, name=env, gid=0, cwd=/home/kexin/Desktop}], ppid:[2622]]

gremlin> g.V().has('pid',2622).valueMap(true)
==>[label:Subject, id:53376, ident:[TrHVh6rQrPmF4iTGq0zhdw==], unitid:[0], pid:[2622], source:[0], commandLine:[sudo env PATH=/usr/local/bin:/usr/local/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games LD_LIBRARY_PATH= auditctl -D], subjectType:[0], properties:[{uid=1000, name=sudo, gid=1000, cwd=/home/kexin/Desktop}], ppid:[2611]]
==>[label:Subject, id:24624, startedAtTime:[1463797860527000], ident:[gd+4gsC6mxASmh4Fh9XNaA==], unitid:[0], pid:[2622], source:[0], subjectType:[0], properties:[{uid=1000, name=audit.sh, gid=1000}], ppid:[2611]]

It would seem that the PID property is indeed not unique (which also means that the ppid is not unique, e.g in the example above, any of the three nodes with PID value 2623 may have been generated by either of the two nodes with PID value 2622).

We may be able to differentiate nodes by using multiple segmentation criteria, e.g PID+Time but in this case, we may need values for segmentation criteria to be available for all nodes of interest. It would seem, at this point, that the Time property is not available for all Subject nodes.

jamescheney commented 8 years ago

Sorry, yes, I forgot about the PID recycling issue. There's no reason why PID + PPID (or + UNITID) would be unique either. I think there is a UUID field that isn't currently being populated, at least in the above example. In the above examples, it looks like the multiple processes sharing PIDs are actually different records corresponding to the same process. e.g. 2622 has one node with startedAtTime but no commandLine, and another with commandLine but no startedAtTime; 2633 is similar but also has a second node with a different commandLine (but identical suffix), so maybe we (or the ingester) should be fusing the three instead of generating multiple records. It would be good to see what the underlying avro records generating this data are and understand what the ingester is doing with them, so I'm assigning to Tom for comment.

I will try to find someone to ask about this at the PI meeting, but it seems that we may have to live with no uniqueness guarantees in any case.

TomMD commented 8 years ago

@jamescheney Am I understanding correctly that you would like to see the Avro version of nodes with the PID 2622 from ta5attack1_units?

jamescheney commented 8 years ago

That would be helpful, if it is not a lot of work. Any insights you have into the other comments above would also be helpful, but I know you have other things to do.

jhanley634 commented 8 years ago

As a matter of personal preference, I liked the 2015 SPADE records; they recorded raw syscalls, decorated with process environment attributes like uid, cwd, etc. The CDM13 trace makes you work harder to learn the same details.

The Monitored Host guarantees pid uniqueness at any instant, but not over the course of an hour. As a practical matter, a linux Monitored Host won't be able to burn through 33,000 fork events in less than two minutes. The current trace has only a small number of fork events - no rollover occurred. (A FreeBSD trace can recycle a PID over a shorter interval since it uses random bits for each new PID, then probes the process table for uniqueness.)

Look for eventType. One of the 2622 vertices you found is used to report fork (10) and exec (9) events:

gremlin> g.V().has('ident', 'gd+4gsC6mxASmh4Fh9XNaA==').in().valueMap(true) ==>[id:80412776, label:EDGE_EVENT_AFFECTS_SUBJECT, ident:[mN1NZzn1f4bW+7U8oekLEg==]] ==>[id:80527544, label:EDGE_EVENT_ISGENERATEDBY_SUBJECT, ident:[QGaTov3emeLM9l1W+8zb/w==]] gremlin> gremlin> g.V().has('ident', 'gd+4gsC6mxASmh4Fh9XNaA==').in().in().valueMap(true) ==>[sequence:[38098], id:80077040, label:Subject, startedAtTime:[1463797860527000], ident:[0pZAPfzfYRhInWlNSmIkaw==], source:[0], eventType:[10], subjectType:[4], properties:[{event id=38098}]] ==>[sequence:[38100], id:38961216, label:Subject, startedAtTime:[1463797860527000], ident:[pQDuGF9O7cXWSPuPlT3yog==], source:[0], eventType:[9], subjectType:[4], properties:[{event id=38100}]] gremlin>

The other 2622 vertex is used to report a setreuid event, which I imagine CDM13 could most closely model with KERNEL_UNKNOWN (26), or better with Agent as below:

gremlin> g.V().has('ident', 'TrHVh6rQrPmF4iTGq0zhdw==').out().valueMap(true) ==>[id:40075344, label:EDGE_SUBJECT_HASLOCALPRINCIPAL, ident:[OKNmO93M1O7WugT3ndojOg==]] gremlin> gremlin> g.V().has('ident', 'TrHVh6rQrPmF4iTGq0zhdw==').out().out().valueMap(true) ==>[agentType:[0], gid:[1000], id:80416872, label:Agent, ident:[1uJUd0iy7Ltm4lhC36KgUQ==], source:[0], userID:[1000], properties:[{euid=0, egid=1000}]] gremlin>

 Cheers,
 JH

jamescheney commented 8 years ago

Thanks. That makes sense; I guess that the ingester is creating multiple nodes corresponding to multiple records mentioning pid 2622. I need to understand CDM better to determine whether we can unambiguously determine whether such references co-refer; if so, then maybe that is something the ingester could/should be doing; if not, then the segmenter or later phases need some kind of duplicate elimination. This seems suboptimal though, since unambiguous information must be available to the TA1 provider at the time of recording. I will try to find time to do a trawl through the ingester code so that I can ask Tom/Erin more intelligent stupid questions.

TomMD commented 8 years ago

I am not optimistic it's valuable to build what is basically Avro knife as a one-off tool for internal use. We can either learn Avro Knife or perhaps we can make the ingester (via an additional flag) include the binary or JSON version of the original CDM as an attribute on each node. Thoughts?

e.g. 2622 has one node with startedAtTime but no commandLine, and another with commandLine but no startedAtTime; 2633 is similar but also has a second node with a different commandLine (but identical suffix), so maybe we (or the ingester) should be fusing the three instead of generating multiple records.

Ingestd is just a stream of CDM to the database with schema normalization. The above request requires ingestd to either buffer data unbound or perform a bit of a dance on insertion - lookup, resolution, and mutation to insert each subject node. If we want to combine nodes to improve the quality of our data we should revive the pattern extractor - which would traverse the graph edges instead of looking up each PID - discovering related subject nodes in a robust manner. Having found subjects which are related more strongly than the unreliable PID, PE could then make a higher level 'Process' node that includes the union of the data from all the Subject nodes.

More generally, I hope we can view the Adapt pipeline as continually writing layers of abstraction on top of the prior data. Ingest is just providing raw data from TA1 in an expected form. PE could provide a higher level, and TA1 independent, abstraction of this data.

TomMD commented 8 years ago

@jamescheney For example, placing the .avro in a tmp folder and executing:

avroknife tojson local:///home/vagrant/tmp/ | egrep 'pid": 2622' > /tmp/foo

Yields a foo file of (after pretty printing via jq) :

[
  {
    "CDMVersion": "13",
    "datum": {
      "properties": {
        "name": "audit.sh",
        "uid": "1000",
        "gid": "1000"
      },
      "pInfo": null,
      "exportedLibraries": null,
      "importedLibraries": null,
      "cmdLine": null,
      "uuid": "grjfgRCbusAFHpoSaM3Vhw==",
      "type": "SUBJECT_PROCESS",
      "pid": 2622,
      "ppid": 2611,
      "source": "SOURCE_LINUX_AUDIT_TRACE",
      "startTimestampMicros": 1463797860527000,
      "unitId": 0,
      "endTimestampMicros": null
    }
  },
  {
    "CDMVersion": "13",
    "datum": {
      "properties": {
        "name": "sudo",
        "uid": "1000",
        "gid": "1000",
        "cwd": "/home/kexin/Desktop"
      },
      "pInfo": null,
      "exportedLibraries": null,
      "importedLibraries": null,
      "cmdLine": "sudo env PATH=/usr/local/bin:/usr/local/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games LD_LIBRARY_PATH= auditctl -D",
      "uuid": "h9WxTvms0KrGJOKFd+FMqw==",
      "type": "SUBJECT_PROCESS",
      "pid": 2622,
      "ppid": 2611,
      "source": "SOURCE_LINUX_AUDIT_TRACE",
      "startTimestampMicros": null,
      "unitId": 0,
      "endTimestampMicros": null
    }
  },
  {
    "CDMVersion": "13",
    "datum": {
      "properties": {
        "name": "sudo",
        "uid": "0",
        "gid": "1000"
      },
      "pInfo": null,
      "exportedLibraries": null,
      "importedLibraries": null,
      "cmdLine": null,
      "uuid": "btm4NoLGiScbUe6OpN8hMg==",
      "type": "SUBJECT_PROCESS",
      "pid": 2623,
      "ppid": 2622,
      "source": "SOURCE_LINUX_AUDIT_TRACE",
      "startTimestampMicros": 1463797860539000,
      "unitId": 0,
      "endTimestampMicros": null
    }
  },
  {
    "CDMVersion": "13",
    "datum": {
      "properties": {
        "name": "env",
        "uid": "0",
        "gid": "0",
        "cwd": "/home/kexin/Desktop"
      },
      "pInfo": null,
      "exportedLibraries": null,
      "importedLibraries": null,
      "cmdLine": "env PATH=/usr/local/bin:/usr/local/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games LD_LIBRARY_PATH= auditctl -D",
      "uuid": "liepgIMn5xO8otRvRPmHbA==",
      "type": "SUBJECT_PROCESS",
      "pid": 2623,
      "ppid": 2622,
      "source": "SOURCE_LINUX_AUDIT_TRACE",
      "startTimestampMicros": null,
      "unitId": 0,
      "endTimestampMicros": null
    }
  },
  {
    "CDMVersion": "13",
    "datum": {
      "properties": {
        "name": "auditctl",
        "uid": "0",
        "gid": "0",
        "cwd": "/home/kexin/Desktop"
      },
      "pInfo": null,
      "exportedLibraries": null,
      "importedLibraries": null,
      "cmdLine": "auditctl -D",
      "uuid": "hv/NumrWtxWKDEwDqMrouQ==",
      "type": "SUBJECT_PROCESS",
      "pid": 2623,
      "ppid": 2622,
      "source": "SOURCE_LINUX_AUDIT_TRACE",
      "startTimestampMicros": null,
      "unitId": 0,
      "endTimestampMicros": null
    }
  }
]

jamescheney commented 8 years ago

Ingestd is just a stream of CDM to the database with schema normalization. The above request requires ingestd to either buffer data unbound or perform a bit of a dance on insertion - lookup, resolution, and mutation to insert each subject node. If we want to combine nodes to improve the quality of our data we should revive the pattern extractor - which would traverse the graph edges instead of looking up each PID - discovering related subject nodes in a robust manner. Having found subjects which are related more strongly than the unreliable PID, PE could then make a higher level 'Process' node that includes the union of the data from all the Subject nodes.

More generally, I hope we can view the Adapt pipeline as continually writing layers of abstraction on top of the prior data. Ingest is just providing raw data from TA1 in an expected form. PE could provide a higher level, and TA1 independent, abstraction of this data.

Thanks, that explanation is (at a high level) all I was looking for: the underlying CDM data defines two process nodes with (as far as I can see) no obvious way to tell that they are really "the same" without heuristics that might be OS-specific.

It would be helpful to have a human-readable explanation of the mapping, specifically how typical avro data are "serialized" to nodes, edges and properties (i.e. not Haskell code since not everyone is fluent in that). But this is probably something I can work out for myself.

I was not suggesting we need a new tool to serialize the avro data; wheel-reinvention is deprecated. Having some way to link graph data back to the source avro (or reconstruct it) might be helpful but the avroknife + grep combo you suggest above is plenty for now.

Unasssigning Tom but leaving issue open in case there is any more discussion.

jamescheney commented 8 years ago

We had a conversation with Ashish about this today.

For TRACE / SPADE data, some processes (and presumably other entities) are added to the trace because their existence is inferred (typically because the process started before tracing started), others because they are directly observed.

If inferred, there is no startedAtTime (i.e. it is null) and the PID (and perhaps other fields) are "inferred" (i.e. guessed from context). The processes with pid=2622 and 2623 and startedAtTime=null in the above trace are examples of this.

If both PID and startedAtTime are present then they should uniquely identify the process (in the absence of units; if those are present then they need to be part of the unique ID too).

Ashish also seemed open to the idea of somehow marking records / fields that contain "inferred"/"uncertain" data. In my view, they should instead be marked as null or "unknown", but perhaps CDM requires a nonnull PID value.

For now, the current behavior seems sensible: we don't merge the segments of different processes with the same PID and we probably shouldn't. We do need to keep in mind that some of the PIDs are inaccurate. So, closing, but I also added a comment to AD-107 to remind us to keep "null" in mind when we work on segmenting by multiple attributes.

jamescheney commented 8 years ago

Here is an update from Ashish, which indicates that the actual reason for this is not inference of PIDs, but "execve" reusing PIDs:

When we spoke in person about these examples, I was led astray by the ‘null’ starting time of the second (and third) subject vertices with the same process identifier. I suggested that these vertices may be “inferred”, rather than “observed”. In fact, we have previously eliminated reporting any “inferred” elements. (So you do not need to perform any processing to filter them out.)

In earlier data (of which this must be an example), the starting time was not propagated to an invoked process. This has subsequently been addressed.

These examples appear to derive from the fact that the Linux execve() call results in the new process having the same process identifier as that of the one in which the call occurred. So in the first example, the ‘audit.sh’ script was running with process identifier 2622 and it would have invoked ‘sudo’. Similarly, in the second example, ‘sudo’ ran with process identifier 2623, invoking ‘env’, from which ‘auditctl’ was run.

I am not sure if the different Subject nodes with the same PID are linked by appropriate edges / events explaining the execve behavior. If so, then this should be visible in segmentation (if the radius is large enough). I'm not sure what we should do if not, because then there may be no way to disambiguate between duplicate PID nodes with no start time resulting from execve or from PID recycling. Hopefully this will not be a major issue for the first engagement, anyway.

davearcher commented 8 years ago

I’m going to add this issue to the list to be sent to all TA1s from TA2s. Specifically, seems like we should ask that subject nodes of the same type, and with the same PID, should always be connected by something like:

Subject1 — Edge_Event_GeneratedBy_Subject — execve event — Edge_Event_Affects_Subject — Subject2

where Subject1 is the parent and Subject2 is the child spawned by execute.

Make sense?

On Aug 4, 2016, at 7:19 AM, James Cheney notifications@github.com wrote:

Here is an update from Ashish, which indicates that the actual reason for this is not inference of PIDs, but "execve" reusing PIDs:

When we spoke in person about these examples, I was led astray by the ‘null’ starting time of the second (and third) subject vertices with the same process identifier. I suggested that these vertices may be “inferred”, rather than “observed”. In fact, we have previously eliminated reporting any “inferred” elements. (So you do not need to perform any processing to filter them out.)

In earlier data (of which this must be an example), the starting time was not propagated to an invoked process. This has subsequently been addressed.

These examples appear to derive from the fact that the Linux execve() call results in the new process having the same process identifier as that of the one in which the call occurred. So in the first example, the ‘audit.sh’ script was running with process identifier 2622 and it would have invoked ‘sudo’. Similarly, in the second example, ‘sudo’ ran with process identifier 2623, invoking ‘env’, from which ‘auditctl’ was run.

I am not sure if the different Subject nodes with the same PID are linked by appropriate edges / events explaining the execve behavior. If so, then this should be visible in segmentation (if the radius is large enough). I'm not sure what we should do if not, because then there may be no way to disambiguate between duplicate PID nodes with no start time resulting from execve or from PID recycling. Hopefully this will not be a major issue for the first engagement, anyway.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/GaloisInc/adapt/issues/50#issuecomment-237567377, or mute the thread https://github.com/notifications/unsubscribe-auth/AJN5vzaI743s9319C7Y_Xy54IIreJpvCks5qcfUBgaJpZM4JUkcS.

David W. Archer, PhD Principal Investigator Galois, Inc. 421 SW 6th Avenue, Suite 300 Portland, OR 97204 email: dwa@galois.com mobile: 503-701-0235

TomMD commented 8 years ago

On Thu, Aug 4, 2016 at 12:44 PM, Dave Archer notifications@github.com wrote:

I’m going to add this issue to the list to be sent to all TA1s from TA2s. Specifically, seems like we should ask that subject nodes of the same type, and with the same PID, should always be connected

I know what you mean, but the actual ask is more nuanced - we don't want unrelated processes that happen to share the PID to be connected. Worth mentioning so we don't get caught in an unproductive e-mail loop with TA1 groups.

-Thomas

by something like:

Subject1 — Edge_Event_GeneratedBy_Subject — execve event — Edge_Event_Affects_Subject — Subject2

where Subject1 is the parent and Subject2 is the child spawned by execute.

Make sense?

/d

On Aug 4, 2016, at 7:19 AM, James Cheney notifications@github.com wrote:

Here is an update from Ashish, which indicates that the actual reason for this is not inference of PIDs, but "execve" reusing PIDs:

When we spoke in person about these examples, I was led astray by the ‘null’ starting time of the second (and third) subject vertices with the same process identifier. I suggested that these vertices may be “inferred”, rather than “observed”. In fact, we have previously eliminated reporting any “inferred” elements. (So you do not need to perform any processing to filter them out.)

In earlier data (of which this must be an example), the starting time was not propagated to an invoked process. This has subsequently been addressed.

These examples appear to derive from the fact that the Linux execve() call results in the new process having the same process identifier as that of the one in which the call occurred. So in the first example, the ‘audit.sh’ script was running with process identifier 2622 and it would have invoked ‘sudo’. Similarly, in the second example, ‘sudo’ ran with process identifier 2623, invoking ‘env’, from which ‘auditctl’ was run.

I am not sure if the different Subject nodes with the same PID are linked by appropriate edges / events explaining the execve behavior. If so, then this should be visible in segmentation (if the radius is large enough). I'm not sure what we should do if not, because then there may be no way to disambiguate between duplicate PID nodes with no start time resulting from execve or from PID recycling. Hopefully this will not be a major issue for the first engagement, anyway.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub < https://github.com/GaloisInc/adapt/issues/50#issuecomment-237567377>, or mute the thread https://github.com/notifications/unsubscribe- auth/AJN5vzaI743s9319C7Y_Xy54IIreJpvCks5qcfUBgaJpZM4JUkcS.

David W. Archer, PhD Principal Investigator Galois, Inc. 421 SW 6th Avenue, Suite 300 Portland, OR 97204 email: dwa@galois.com mobile: 503-701-0235

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/GaloisInc/adapt/issues/50#issuecomment-237662023, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIaTkvlIGYlbfHDbe4ZrHBtvBScNk8lks5qckEYgaJpZM4JUkcS .

jamescheney commented 8 years ago

Yes, I would suggest something like "if processes share a PID due to immediate reuse such as Linux's execve, please provide an explicit link so that we can distinguish this from PID recycling". I haven't had a chance to look at actual data to see whether the execve event is already there in the TRACE data (been traveling). It may well already be there; I'll write to Ashish to ask whether he thinks we should be seeing it.

In any case it would be good to be clear whether we can expect this to be reflected in the data or not.

davearcher commented 8 years ago

Done. I’ll send them this note right now. /d

On Aug 4, 2016, at 1:47 PM, James Cheney notifications@github.com wrote:

Yes, I would suggest something like "if processes share a PID due to immediate reuse such as Linux's execve, please provide an explicit link so that we can distinguish this from PID recycling". I haven't had a chance to look at actual data to see whether the execve event is already there in the TRACE data (been traveling). It may well already be there; I'll write to Ashish to ask whether he thinks we should be seeing it.

In any case it would be good to be clear whether we can expect this to be reflected in the data or not.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GaloisInc/adapt/issues/50#issuecomment-237678694, or mute the thread https://github.com/notifications/unsubscribe-auth/AJN5v7xnR7prm1vWAJdcmjv7FYNEcPFDks5qck_wgaJpZM4JUkcS.

David W. Archer, PhD Principal Investigator Galois, Inc. 421 SW 6th Avenue, Suite 300 Portland, OR 97204 email: dwa@galois.com mobile: 503-701-0235

jamescheney commented 8 years ago

Ashish's response:

When an execve() is encountered in the Linux Audit log, an EVENT_EXECUTE vertex (and appropriate edges from / to relevant Subject vertices) will be emitted in the CDM data.

So we should be following the EVENT_EXECUTE edges. PIDs may not be unique but duplicates resulting from execve should be linked to the generating process.