Feature request: import offline collector results into existing server datastore

chris-counteractive commented 3 years ago

Currently results created using offline (or "stand-alone") collectors cannot be integrated with results in an existing server datastore. For certain use-cases it would be very useful to treat these results as "first-class" within the GUI, opening them up to analysis (e.g., with notebooks) and reporting and export, the same as their online peers.

put another way, treat offline-collected clients the same as the rest, just with a very high latency connection.

notional story

There are design decisions involved, and edge cases to be solved, but might be within reach after a brief discussion on discord. Consider the following possible workflow as a start to the conversation:

user creates a collector using the "build offline collector" wizard
in addition to whatever artifacts the user chooses, the collector always collects a variant of Generic.Client.Info with an additional optional parameter, ClientID
1. if ClientID is unspecified (the default), a new, unique Client ID will be generated as if this were a "normal" online client, and will be used on import to create the requisite metadata to display it in the GUI as if it were an online client.
2. if ClientID is specified, the results will be merged with the existing client on import.
3. Bonus: UI that allows the user to search and select ClientID from among existing clients (e.g., by hostname).
4. Bonus: alternate path to the collector wizard with this pre-populated, via a button from an existing client's information page or search listing, "create collector for this client"
user runs the collector on a system. The results will be stored in any supported location (zip, s3, etc.)
on the server, user imports the data, either by:
1. running a new CLI command like velociraptor import --datastore /path/to/datastore --collector-results /path/to/results.zip (edit: or just specify a server config rather than datastore location, whatever is most consistent) ... with corresponding options for specifying cloud collector locations, or
2. navigating to a new GUI page using a new button (perhaps right next to the collector creation button), where they can select a results zip file or s3 bucket or other supported location
the client shows up as offline in the GUI, but otherwise is identical to any other system in the datastore
import time could be tracked as "last connected" time, or similar, treating the offline collection process as a very high-latency network connection ...

edge cases

user runs an offline collector on a system where there's a running ~~collector~~ client already. repurpose the "running system" ID? do nothing outside the normal workflow ("it's just another process ...")?
user wants to install online agent after having run one or more offline collections, and wants to merge them.
multiple collections are run and imported as if on different endpoints, and later a user finds out they're the same endpoint. merge them? (could be nice for online situations too, when for various reasons a system ends up with multiple client IDs from multiple velociraptor installs or runs). do nothing?

Thanks for the discussion!

scudette commented 3 years ago

Just pushed #761 which implements a new commandline option "import" must be run on the server at the moment:

export VELOCIRAPTOR_CONFIG=....
velociraptor import --hostname XXX Collection-DESKTOP-25CK4TB-2020-11-20_00_34_23_-0800_PST.zip --create

Will create a new host with hostname XXX or use the first client with that hostname and upload the offline collection to it. You should be able to see the collection with a new flow id in the normal flows menu.

Please give it a try and see if it works for your use case.

chris-counteractive commented 3 years ago

Works like a charm, sir, thank you very much, it gets the job done! I like the option to specify either a client ID or a hostname, and I like that it'll work with "legacy" collections (i.e., doesn't require a particular artifact having been in the bundle). This definitely meets our basic need - if we're working a response or hunt, and there are systems that have been pulled offline for containment, say, we can collect offline and roll them into the overall data without separate procedures for the collected zips. Marvelous.

Some food for thought as as you decide how/whether to keep enhancing this:

in our use-case we often don't know offline hostnames in advance. we might say "run this on all the boxes you already disconnected," and get a bunch of zips back. in the current configuration the hostname is embedded in the zip file (.\Collection-myhostname-2020-11-20_10_15_18_-0600_CST.zip), so it's straightforward to get it that way, but that feels a bit brittle - the user might rename the file, etc. Not sure it's worth the extra effort (see above - this works fine :)), but you could imagine pulling this from the collection itself if available (e.g., "if you see Generic.Client.Info, use the data from that") or even pulling it from the filename (e.g., "if the filename matches this regex, treat this capture group as the hostname"), if the --hostname flag isn't present. (edit: this also gives some protection against fat-finger typos - which are sometimes just annoying, but sometimes can collide with another existing host, e.g., when folks use hostnames with serial numbers or incrementing IDs)
When you import this way, the "Last Active" metadata isn't set, or at least, isn't displayed: Again, not a biggie, but especially for artifacts collected offline, it'd be nice to know when they were collected vs. just when they were imported (which I think is captured under the Created timestamp). It also aligns with the idea of making these collections as indistinguishable from "normal" collections as possible.

Thanks again, I'm always knocked out by your responsiveness.

clayscode commented 3 years ago

The offline collections don't seem to be viewable at the moment for me. When I click on the flow id, it just comes up blank. offline

scudette commented 3 years ago

@clayscode it seems that it failed to recognize any artifacts in your Zip file (artifacts column is empty) - did the zip file contain custom artifacts? was the importing binary run with the server config (so it can find the custom artifact definition?).

scudette commented 3 years ago

@chris-counteractive Velociraptor usually treats the hostname as an identifier that is being indexed so we can search on it easily to find the client id - so it doesn't necessarily have to be an actual dns name. You can always specify --hostname "Bobs Machine" just as long as you can search for it later. The client id just collects related collections to what we believe is a unique machine - but it maybe in fact that two client ids are really the same box - but we have no way to know that.

It depends how you want to manage the collection - but Velociraptor does not really make any assumptions that collections are related to each other - that is open for interpretation. So it might not make a lot of difference if the real client id is the same as the offline client id. I think the main advantage of this feature is to be able to use the notebook to post process collected results so we don't have to resort to external json manipulation tools like jq or miller. So for example, we can collect the MFT, post process it and filter it, remove some columns and re-export a smaller csv file of relevant data simply by importing the offline collection, accessing the flow notebook (Which is created automatically for every flow collection) and then just adding some WHERE filters and columns specifications (or use the GUI to show/hide columns) then click the export to csv button.

Perhaps a related but different feature is to merge two client ids together - so if it becomes apparent later that the clients are somehow the same, we can just merge them to the same client id to avoid confusion.

As for the last active metadata - it is not filled because we have no idea when the collection was actually taken. I didn't want to add a hard requirement for a metadata file to be added to the collection but maybe make that optional? this way the high level metadata can be added at collection time and just extracted at import time if possible.

What information do you think we would like to add? I was thinking of:

Collection time
Hostname
Username that ran the collection (but this is usually administrator so maybe not interesting?)
Some AD info like domain name, hostname from AD etc.

From a design POV we can simply create an additional artifact that will be collected automatically with every offline collection - then the importer can just look for that artifact and use it to popular the flow data.

clayscode commented 3 years ago

Does the server need to know about the custom artifact definitions? Assumed it just grabbed the filename or something and used that as the artifact. The collection is a nested artifact with the top level artifact as the folder name and the sub artifacts as the JSON files (e.g. Custom.Example.Artifact/Custom.SampleArtifact.json). If I need to rename things/put artifacts in their own folders I can do that.

scudette commented 3 years ago

Currently there is no structure in the offline collector zip - we just have a bunch of files, some can be artifact result sets and some can be uploaded files. The importer uses the name to figure out if the file is an upload or an artifact result - so this is why it needs to recognize the names.

We generally want to have a link between the result set json and the artifact that generated it - the GUI can use it for annotating column types (like timestamps etc). So we probably dont want to blindly load result sets without knowing the artifact that generated them.

Is the issue that you are tying to import a collection produced by a different installation of Velociraptor without the custom artifacts? Or do you have trouble importing custom artifacts that should be recognized?

clayscode commented 3 years ago

So I've added my custom artifacts to the velociraptor server but it's still not recognizing the offline collection. I imagine the issue is that my artifact looks like this:

sources:
- name: Example
    queries:
        - SELECT *,uuid() AS UUID FROM Artifact.Custom.Example()

scudette commented 3 years ago

Thats interesting - the artifact you describe has a named source which would change the way it has been written to the zip. I tested by creating a collector collecting Generic.Client.Info which also has named sources. The produced zip file contains the following

$ unzip -l /shared/Collection-DESKTOP-25CK4TB.localdomain-2020-11-23_06_58_13_-0800_PST.zip
Archive:  /shared/Collection-DESKTOP-25CK4TB.localdomain-2020-11-23_06_58_13_-0800_PST.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
      313  1980-00-00 00:00   Generic.Client.Info/BasicInformation.json
     1054  1980-00-00 00:00   Generic.Client.Info/Users.json
---------                     -------
     1367                     2 files

and importing it works correctly.

I then customized the artifact to create Custom.Generic.Client.Info repeated the process and it worked as well

I imported the collection using

$ ./output/velociraptor-v0.5.2-linux-amd64 --config ~/server.config.yaml import --hostname XXX /shared/Collection-DESKTOP-25CK4TB.localdomain-2020-11-23_07_04_42_-0800_PST.
zip 
Importing zip file /shared/Collection-DESKTOP-25CK4TB.localdomain-2020-11-23_07_04_42_-0800_PST.zip into client id C.9ca8a7c498264e89
Filename Custom.Generic.Client.Info/BasicInformation.json
Filename Custom.Generic.Client.Info/Users.json

Can you please attach the output of the import command?

chris-counteractive commented 3 years ago

@scudette thanks for the follow-up! I totally understand velociraptor's not making any guarantees with respect to hostname (just treats it as a label) or even client ID, and I'm completely content with the functionality as it stands - using it for unified notebook analysis is certainly a key driver. It's also nice in our use-case to be able to go to the velociraptor server as the central store for collected "raw" artifacts (e.g., from KapeTriage) and this new import feature will keep that much more consistent and organized.

It perhaps wasn't clear in my edited comment, but the "automated pulling of hostname" idea was less about changing the semantics or guarantees of hostnames in VR, more about avoiding inadvertent collisions when doing manual entry. We've had cases where systems are named with the last few digits of their service tag number, for example, and if there's both a financehost0105 and a financehost0IO5, pulling it from the filename or an optional collected artifact would help avoid accidentally associating one host's offline collector results with the other.

Merging does seem to be a feature that would render most of these decisions lower-impact: if you can ex post facto decide to artbitrarily combine various collections, that empowers the user to solve a lot of the edge cases. Probably a more significant undertaking though, not sure the juice is worth the squeeze, but it would be nice.

I didn't want to add a hard requirement for a metadata file to be added to the collection but maybe make that optional? this way the high level metadata can be added at collection time and just extracted at import time if possible.

Totally sensible, I like keeping it backwards compatible with previous offline collections. But if the data's there, it'd be nice to use it 😃

From a design POV we can simply create an additional artifact that will be collected automatically with every offline collection - then the importer can just look for that artifact and use it to popular the flow data.

Absolutely, yes sir - that's what I had in mind in the notional story up top of this thread. I think Generic.Client.Info already gets you most of the way there - it has most of the items you suggest, though not the collection time metadata.

Speaking of collection time metadata, I was reviewing some previous test imports and I noticed some wacky times in the logs (note the date, 52858-08-08):

I can open a separate issue for that if you like. Thanks again!

clayscode commented 3 years ago

Hmm no dice. Will try the latest CI build.

velociraptor --config /etc/velociraptor/server.config.yaml import --hostname TestHost Collection.zip 
Importing zip file Collection.zip into client id C.a7297848191d149b
Filename Custom.Test/ClientInfoA.json
Copying file Custom.Test/ClientInfoA.json -> /clients/C.a7297848191d149b/collections/F.BUUI1M6H0HGEM/uploads/file/Custom.Test/ClientInfoA.json
Filename Custom.Test/ClientInfoB.json
Copying file Custom.Test/ClientInfoB.json -> /clients/C.a7297848191d149b/collections/F.BUUI1M6H0HGEM/uploads/file/Custom.Test/ClientInfoB.json

output

clayscode commented 3 years ago

Ah, it's working now after I created the artifact on the server. My other collection refuses to import though, even though I imported my artifacts to the server... Success

Interesting, after recreating the artifacts in the GUI instead of just importing them like /usr/bin/velociraptor --config /etc/velociraptor/server.config.yaml frontend -v --definitions=/artifacts it recognizes my artifacts now.

scudette commented 3 years ago

Ah that makes sense - if you keep you artifact definitions in another directory you will need to also load that one during importing the zip file (it is the importing process that needs to learn about all the definitions). In that case you just need to also specify the --definitions flag to the import command.

clayscode commented 3 years ago

Hmm, that still doesn't seem to be working. Same issue with it coming up blank in the server even though I'm specifying my definitions folder on import.

scudette commented 3 years ago

Ah good point - thanks for testing it. It should work in the latest CI build (and will be in 0.5.3)

scudette commented 3 years ago

I am going to close this issue since the basic capability is there- please open a new issue if we need to improve it more.

Velocidex / velociraptor

Feature request: import offline collector results into existing server datastore #718

notional story

edge cases