parse_evtx does not create Message column when parsing Windows PowerShell.evtx

chris-counteractive commented 4 years ago

summary: parse_evtx does not create a Message column when parsing C:\Windows\System32\winevt\Logs\Windows PowerShell.evtx

environment:

Microsoft Windows [Version 10.0.19041.572]
velociraptor v0.5.1-1, latest windows release

steps to reproduce:

run velociraptor gui on a windows machine
create a notebook
run the following query and note it produces three columns (System, EventData, and Message):
```
select * from parse_evtx(filename="C:\\Windows\\System32\\winevt\\Logs\\Security.evtx")
```
this works for built-ins and 3rd-party logs like Sysmon too. 👍

run the following query and note it only produces two columns (System and EventData)

select * from parse_evtx(filename="C:\\Windows\\System32\\winevt\\Logs\\Windows PowerShell.evtx")

impact:

without this field, events from this log are filtered out in the where clause of other artifacts like MessageHunter
it's a barrier to migrating from tools that parse that reliably, like winlogbeat

analysis

Haven't had much time to dive deep or really narrow it down.

Thought maybe it could be something simple like spaces in the filename (not many standard evtx files have them), but other space-y paths are fine (select * from parse_evtx(filename="C:\\Windows\\System32\\winevt\\Logs\\Microsoft-Windows-Windows Firewall With Advanced Security%4Firewall.evtx") works as expected).

That log's key path is also a bit odd in that the channel and provider key names are different ("Windows PowerShell" and "PowerShell"), but no idea if that matters:

I noticed there's only a Security.evtx test case, this might be a good one to add. Could be this happens with other logs too, though this was the only one I've seen in happen with.

Sorry for not having a better root-cause, but you're usually lightning at sorting that out anyway! Happy as always to help debug, or to revise if you can't reproduce it. Thanks!

scudette commented 4 years ago

As always thanks for the detailed analysis and reporting. The issue is that the powershell event log provider does not have a GUID for this provider, just a name. Normally providers have both but in this case they didn't register a GUID. The code tries to get the GUID but then erroneously quits when there is no GUID as you can see here:

https://github.com/Velocidex/velociraptor/blob/44748662a7583710fa48131e9c97754fe86cc76a/vql/parsers/event_logs/evtx_windows.go#L32-L35

But this is still ok because there is a provider name to go with. Removing that error handling fixes the issue because we can still resolve the message dll with the name alone.

I will submit a fix shortly

scudette commented 4 years ago

The other issue you touched on is the MessageHunter artifact (@mgreen27 can help with that) but I think scanning for keywords in the message is not enough. The message sometimes does not interpolate ALL the data in the UserData area.

Case in point is Event 104 - log file cleared which does not include any of the user data in the message

I think we need to change that artifact to look for keywords anywhere in the event not just the message (which would hit on the user data fields even if they are not interpolated to the message).

mgreen27 commented 4 years ago

@scudette didnt we find a similar issue previously in the Powershell event log?

with regards to MessageHunter -> I think we can just serialise either the whole event log entry or data fields and run the regex filters over that. That do you think @chris-counteractive @scudette ? Maybe we then need to change the name to EvtxHunter in that case.

scudette commented 4 years ago

Yeah I thought the point of this artifact was as a more refined version of https://gist.github.com/scudette/b58a2ac2b4890bd18eedfcd900c244a7 - changing the name will good as it is more descriptive.

chris-counteractive commented 4 years ago

@scudette makes perfect sense, glad there was a quick fix - your debugging prowess as usual on full display! glad to have helped in some small way.

I don't have strong feelings about MessageHunter, @mgreen27 - the idea of serializing and searching across all fields (maybe even field names too) seems like a good call for inclusivity, as is the naming change. I certainly see the limitations of looking through the Message value alone.

In our workflow (3rd-party DFIR/hunt, often at clients without SIEMs or aggregators) we've been post-processing evtx files with winlogbeat and shipping them to elastic to take advantage of sigma rules, dashboards, saved queries, ECS normalization, etc. We'd like to reduce our tool count and customization by shipping the parsed logs directly from velociraptor, but for now there's a few limitations that still require some straightforward extra steps:

no built-in artifact for parsing all event logs. this is an easy fix, we just roll a simple Custom.Windows.EventLogs.All (which helped us find this bug actually). this might be useful for others, @scudette - is there a process for "vetting" a contribution of a new artifact like that? This is for the snapshot pull use-case, not event streaming. Similar to pulling the full MFT.
the elastic_upload function and related artifacts don't support convenient enrichment tools like filebeat processors or vector transforms. I'm sure that's all possible with VQL kung fu, and probably out of scope for velociraptor, so we just use filebeat. Taking winlogbeat out is nice because it needs a windows box, better for velociraptor to do the parsing work on the hosts themselves (with the side benefit of ensuring any lookups are local to the original environment).

thanks again for the quick diagnosis!

scudette commented 4 years ago

We accept pull requests! So just send your artifact suggestions via pull request to be included.

Re elastic, we do support pushing data directly to logstash so we can just be a beats replacement - you can continue doing all the usual enrichment with your current logstash configs.

https://www.velocidex.com/blog/medium/2019-12-08-velociraptor-to-elasticsearch-3a9fc02c6568/#slice--dice-with-logstash

mgreen27 commented 4 years ago

I just added a PR for this. It should make the query much more effective.

chris-counteractive commented 4 years ago

@scudette thanks! in polishing up our custom artifact i filed another issue (#717) regarding the GUI and artifact parameter defaults for timestamps. i'll get it submitted soon.

and logstash is a great option too, thanks for the reminder!

weslambert commented 4 years ago

@chris-counteractive I'd love a custom WEL artifact! I can certainly help test, if needed.

You mentioned picking up logs with Filebeat -- I am doing the same at the moment (as well as sending direct to ES, when necessary). I've considered re-writing artifacts so that they are ECS-compliant OOB, however, sometimes there can be complications with this, so I've begun mapping artifacts to ECS and developing the associated ingest pipeline config (can share more on that if you are interested). Have you considered mapping the fields defined here to an ingest pipeline? I'm thinking about doing something similar myself for Windows artifacts/logs, and would be interested if you see this as something that would be useful.

chris-counteractive commented 4 years ago

@weslambert totally useful, always interested in how others are tackling this, making evtx cool again. whether it's ingest pipelines, logstash filters, beats processors, or custom VQL, it's useful to post-process the evtx to a more normalized format, and ECS helps a lot.

Currently our "all logs" artifact is stupid simple, just select * from parse_evtx() with a glob, start and end timestamps, and a boolean for whether to check VSS. i'll post it as soon as i hear about #717. All the rest is filebeat, and yeah, we use that same field reference - we just try to match it to whatever winlogbeat setup creates in terms of index templates in elastic.

We try not to reinvent the wheel and have found you can get a lot of mileage just mapping EventData to winlog.event_data and using the processor scripts included with winlogbeat 🔥 For the rest you have to implement something akin to the core winlogbeat logic here, all under Apache 2.0, which we do some of in our processors but I know we could be more thorough. Always looking for a smarter way!

predictiple commented 4 years ago

@chris-counteractive Personally I think ECS is bunk and that Elastic have gone down the wrong road with it. Elasticsearch's dynamic schema is one of it's greatest strengths and yet their ECS goes against that by requiring that the schema be defined in advance. And then that schema involves heavily nested fields which leads to search problems like this: https://www.bmc.com/blogs/elasticsearch-nested-searches-embedded-documents/

For a common data type like evtx we certainly know the schema in advance, and the same can be said when using the bundled plugins for other data types, but Velociraptor can return any type of data and that core strength should not be constrained by predefined schemas.

IMHO the OSSEM schema is far better than ECS and well-aligned with the "arbitrary" data that Velociraptor returns, which for the most part is flat tabular data. For formats like evtx and auditd the VR plugins' default schema can easily be flattened in Logstash and mapped to OSSEM. With a flat schema you can leverage Elasticsearch's dynamic type mappings. I map everything into 'keyword' (categorical) type fields including numerics, and then make exceptions for a few well-known numeric fields where arithmetic might actually be required. For example received_bytes might need to be summed but I haven't yet encountered a situation where I'd need to subtract one event_id from another event_id, so 'keyword' type is appropriate for most numeric fields because they are categoricals rather than arithmetic values.

Using an OSSEM-aligned schema then also makes it much easier to take advantage of the Sigma rule base, since you are then essentially using the HELK field mappings. By taking the OSSEM route you forego most of the Elastic SIEM/Security features - because these presume ECS - but it looks like the new Elastic Detection Rules will still work (with just field name remapping). In addition, this guy wrote a Sigma converter for Elastic Detection Rules which means that you can use the Sigma rules with HELK field mappings if your data is OSSEM-aligned.

I've been really disappointed by Elastic Ingest Pipelines not living up to the hype, and the Elastic Beats all totally rely on Ingest pipelines. Logstash is far more capable in terms of transformation capabilities. Fortunately there is a way to "borrow" (I could think of other words) the effort that the Elastic people have put into the Beats Ingest pipelines. I get the impression they're trying to prevent this by making it obscure, so let me make it a bit clearer... ;-)

Elastic Beats contain "modules" which contain their Elastic ingest pipelines specified in YAML format (although they used to be JSON... hmm I wonder why they changed it?)... the latest version of Filebeat contains modules for 58 log types!
Logstash includes an ingest pipeline conversion tool that accepts JSON-format pipelines and produces native-format Logstash conf files.
Convert the YAML to JSON and then use the tool to convert the JSON to Logstash conf. The final output will not be production-ready but it gives you the basis for your own Logstash pipelines for all their log types.

Similarly you can use rsa2elk to "borrow" parsers for 300 data types from RSA NetWitness and use them to build Logstash pipelines for the ones you need.

HTH

weslambert commented 4 years ago

@predictiple good thoughts 👍 I'll agree that Logstash is more capable from a transformation perspective. However, in the past we had issues with Logstash being too heavy, too slow to start (10+ mins in some cases loading all the config/pipelines), and most of what we needed was able to be handled with Elasticsearch and Ingest pipelines -- seems like in newer releases, this load time has improved for Logstash. We now use either just Ingest or Ingest + Logstash to reduce the need for Logstash config, but can still take advantage of it if we want to. We also use Sigma and convert to Elastalert rules with our custom/ECS field mappings. With regard to predefined schemas -- we use this to prevent field explosion. Something with dynamic key/value pairs, could be dynamically parsed into a root field, like you suggested. I'll stop there though, bc I don't want to take this into a whole separate Elastic discussion 😄 .

predictiple commented 4 years ago

@weslambert I also don't want to take this off-topic but it's good to hear how others are thinking about and tackling the problem. There is an open issue https://github.com/Velocidex/velociraptor/issues/263 that addresses the broader topic of whether VR artifacts should somehow be standardised. My Logstash pipeline is such that any data collected by VR is automagically standardised (majority of fields end up matching OSSEM) and can be inserted into ES without any schema planning. Some data types get additional parsing, for example evtx. It works pretty well and so far I've never hit the 10000 field limit that I set on the ES side, despite throwing highly varied data at it and the fact that each field also has a tokenized ("text") sub-field. Also consider that some VR artifacts can generate data records that aren't timestamped (i.e. non-event data) but which are still relevant to an investigation and creating schemas for all of those ad-hoc things would be quite a challenge. I like the idea that people shouldn't have to worry about schemas when writing VR artifacts and I find it strange that anyone would be happy to go search for Windows.Event.System.EventID (ok slight exaggeration there to make the point) instead of event_id.

chris-counteractive commented 4 years ago

@predictiple I appreciate the perspective! Hopefully it's clear I don't think velociraptor should take a position on "officially" normalizing any logs; a clean parse with "original" names and structure lets folks use whatever schema makes sense for them.

I'm certainly not dogmatic about ECS, but have seen some benefits being able to use "out of the box" elastic security content without heavy customization. Apologies for nudging this issue into a tangent, but I'm glad to see others thinking about this stuff too.

For a related discussion, both you and @weslambert might be interested in this thread about ATT&CK data sources - a place to nerd out about data models outside of this bug report 😃 Thanks!

chris-counteractive commented 4 years ago

Confirmed fixed as of acaf17d (probably earlier, just tested latest). Thanks @scudette!

chris-counteractive commented 4 years ago

Also, opened #724 at @scudette's suggestion. Thanks!

scudette commented 4 years ago

Should be fixed in 0.5.2

Velocidex / velociraptor

parse_evtx does not create Message column when parsing Windows PowerShell.evtx #712