arctir / proctor

A CLI and libraries acting as a toolkit for introspecting software from source to runtime.
Apache License 2.0
17 stars 1 forks source link

Provide diffs of different process snapshots #13

Open joshrosso opened 1 year ago

joshrosso commented 1 year ago

Proposal

proctor ps should be able to diff 2 different process snapshots. An example of this command's usage could be:

proctor ps diff snap1 snap2

The output data-structure could be the following in its JSON representation (note, there are many ways we could present this).

{
  "4659": {
    "snap1": [
      {
        "CommandPath": "/usr/bin/gopls",
        "ID": 4659,
        "BinarySHA": "323b27bbb7f932d7506292478a452fe8f8b946332a842a713d9961c9bb86f058",
        "CommandName": "gopls",
        "FlagsAndArgs": ""
      }
    ],
    "snap2": [
      {
        "CommandPath": "/usr/bin/gopls",
        "ID": 4659,
        "BinarySHA": "1111111ddsb55551112492478a452fe8f8b946332a842a713d9961c9bb86f058",
        "CommandName": "gopls",
        "FlagsAndArgs": "-debug=debug.log"
      }
    ]
  }
}

Details

Using plib, proctor process is able to load all process, and thus creates a cache on the filesystem for all processes. Today this cache is thought of as a singular instance of process information, which is cleared whenever processes are reloaded without the cache. For example, at this time of writing, this is done by using the --refresh-cache flag on proctor commands like proctor ps ls --reset-cache.

However, a cache could also be thought of as a snapshot, which we could persist in some indexable way such that it could be looked up and operated on at a later time. Snapshots could be searched and cleared appropriatly. Perhaps by introducing a snap(shot) command?

proctor ps snapshot {create || list || delete}

Question and food for thought

  1. Is the PID what identify the object(s) that need diffing.
    1. For example, consider pid 7, which could be 2 completely different processes over time since process can be removed and created re-using a PID.
    2. To me, PID still feels like the accurate identifier as trying to tie other metadata together as an identifier is error prone, and we shouldn't find Linux hosts reusing PIDs too frequently. E.g., until the number in /proc/sys/kernel/pid_max is reached, the PIDs won't be recycled. 1, How should diff ignore fields work?
    3. I think initially we should just use sensible default. Things like memory addresses and CPU that a process is scheduled on should not show up in the diff.
snowandcaffeine commented 1 year ago

Will likely want a schedule for rotation (retain 10) and frequency of snap. Maybe just document examples using cron?

WRT specific questions I would imagine scenarios where SHA and Name are also useful, so we build the snap based on running state and someone can diff to see the delta between snap 1 and snap 7, but what specifically changed (SHA, Name, Path, PID) needs to be highlighted. Implications here could be process restart (PID change) update/change to binary (SHA) or path/name.

Where this gets interesting is cross system level, where the platform itself could perform this type of operation.

joshrosso commented 1 year ago

Will likely want a schedule for rotation (retain 10) and frequency of snap. Maybe just document examples using cron?

Examples using cron would probably be best. Since proctor is our CLI tool, not a long-running daemon it'd require something to do actual scheduling on it. On our internal systems/agents for our platform this is a concern we can handle seperatly.

WRT specific questions I would imagine scenarios where SHA and Name are also useful, so we build the snap based on running state and someone can diff to see the delta between snap 1 and snap 7, but what specifically changed (SHA, Name, Path, PID) needs to be highlighted. Implications here could be process restart (PID change) update/change to binary (SHA) or path/name.

I thought about this too, but I'm not entirely convinced. The issue is that an executable can be, and is often, used to create multiple processes. Which means you'll end up with a many "keys" that are the same when doing the diff. Take for example my computer where I'm typing this message and want to see the process's associated with chromium:

$ proctor ps get --name chromium
+------+----------+----------------------------+------------------------------------------------------------------+
| PID  |   NAME   |          LOCATION          |                               SHA                                |
+------+----------+----------------------------+------------------------------------------------------------------+
| 2589 | chromium | /usr/lib/chromium/chromium | 9a69e492c0927fd48d8ed2bd9d315890324c5570ec3d61540664348a090885e0 |
| 2629 | chromium | /usr/lib/chromium/chromium | 9a69e492c0927fd48d8ed2bd9d315890324c5570ec3d61540664348a090885e0 |
| 2581 | chromium | /usr/lib/chromium/chromium | 9a69e492c0927fd48d8ed2bd9d315890324c5570ec3d61540664348a090885e0 |
| 4189 | chromium | /usr/lib/chromium/chromium | 9a69e492c0927fd48d8ed2bd9d315890324c5570ec3d61540664348a090885e0 |
| 2660 | chromium | /usr/lib/chromium/chromium | 9a69e492c0927fd48d8ed2bd9d315890324c5570ec3d61540664348a090885e0 |
| 2860 | chromium | /usr/lib/chromium/chromium | 9a69e492c0927fd48d8ed2bd9d315890324c5570ec3d61540664348a090885e0 |
| 2590 | chromium | /usr/lib/chromium/chromium | 9a69e492c0927fd48d8ed2bd9d315890324c5570ec3d61540664348a090885e0 |
| 2912 | chromium | /usr/lib/chromium/chromium | 9a69e492c0927fd48d8ed2bd9d315890324c5570ec3d61540664348a090885e0 |
| 4230 | chromium | /usr/lib/chromium/chromium | 9a69e492c0927fd48d8ed2bd9d315890324c5570ec3d61540664348a090885e0 |
| 2611 | chromium | /usr/lib/chromium/chromium | 9a69e492c0927fd48d8ed2bd9d315890324c5570ec3d61540664348a090885e0 |
| 2592 | chromium | /usr/lib/chromium/chromium | 9a69e492c0927fd48d8ed2bd9d315890324c5570ec3d61540664348a090885e0 |
+------+----------+----------------------------+------------------------------------------------------------------+

Thus, I believe when doing a diff, the PID is actually the correct identifier to compare against. Because, I believe, the question you're trying to answer is "how has processes changed over time?" And in the chase of a new instance of a process coming up, that should show up in the diff as "new", not "modified" as correlating to an old PID may be nearly impossible and very fraught with error.

However, perhaps I'm wrong in this thinking and if so, would like to hear different perspective(s).

Where this gets interesting is cross system level, where the platform itself could perform this type of operation.

Do you just mean in our platform? Because yes, this would happen just in a different level since we're persisting data over time to a datastore, so the equivalent of "snapshots" are there, just implemented differently.

snowandcaffeine commented 1 year ago

Thus, I believe when doing a diff, the PID is actually the correct identifier to compare against. Because, I believe, the question you're trying to answer is "how has processes changed over time?"

Sure, but PID by it's lonesome is somewhat arbitrary as it doesnt show 'did the binary get updated', 'did the path change' or more generically what exactly changed. Imagine a world where some process is restarted nightly or a node reboots. Between those restarts someone makes a change, but the running PID doesnt pick it up. I might be over complicating this, but to me we want to at least get someone who is debugging pointed in the right direction. Not seeing how PID does that earnestly.

Do you just mean in our platform? Because yes, this would happen just in a different level since we're persisting data over time to a datastore, so the equivalent of "snapshots" are there, just implemented differently.

Yes, specifically we would want to be able to validate the SHA across nodes/clusters to see if anything changed and when.

joshrosso commented 1 year ago

Sure, but PID by it's lonesome is somewhat arbitrary as it doesnt show 'did the binary get updated', 'did the path change' or more generically what exactly changed. Imagine a world where some process is restarted nightly or a node reboots.

Perhaps the question we're dancing around is, what do we want to identify the differences of.

If you're trying to see how process's change over time, the pid would be the best identifier as it will stay with the process from start to finish. But if we're not diffing the process's themselves, what would we like to diff?

In other words, could you answer for me:

As I user, I'd like diff to help me identify ...

snowandcaffeine commented 1 year ago
  1. drift over 1-N instances of a process across my environment. Is every copy of mysql the same binary, version and deployed in the same manner

  2. ability to track change on a single instance of a process. What changed, when and ideally who/why (last bit is a stretch)

I assume 1 = platform; 2 = proctor. However open to ideas on right angle of approach

joshrosso commented 1 year ago

I agree on 1. However this scope of this issue is to understand introducing this feature to proctor. So I don't believe it's relevant for this convo.

Regarding 2, I can't understand then why PID is inadequate.

I'll setup a meeting so we can discuss deeper and log the output here.