Open coolacid opened 10 years ago
My own internal convention has largely been underscore-based (ie source_ip, etc) - this is mostly because I can add dynamic mappings in Elasticsearch easily, without caring what the full path to the field is or otherwise worrying with configuring ES to find the field properly, so I can ensure that fields named _ip are stored as IPs with a raw field that's an unanalyzed string, while _count fields are longs without a second component, etc.
Ok - so underscores are important (vs other options?) What about abbreviations - ie: src vs source.
What things do we need to consider in the standards?
I generally opt for verbose/explicit where possible, so I'd tend to write "source_ip" (though I contradict myself and use "dest" rather than "destination", so take that with a grain of salt.
verbose vs screen real estate -- this is why I opted for the src_* instead.
More comments welcomed -- and send to your feeds ;)
src
and dest
are pretty standard abbreviations in UNIX. Basically if it's common in UNIX or Linux, then might as well follow the convention. I would just avoid new abbrevations. I would prefer checksum
vs chksm
, or rubydebug
vs rbdbg
.
If we're just talking about logstash conf files, then I would opt for:
_
) over hyphens (-
)4 space
tabs. (Of course, heh)Regarding fields I don't have as much of a preference. I don't usually change field names.
@shurane I want to change field names -- when I search for say a src_ip I'd like results from all logs for that IP.. I wouldn't want to have to build a query with multiple fields to say the same thing.
From a programmer's perspective i love having sub fields ( src.ip src.port ) but like @torrancew mentioned its easier to match against fields ( _ip , _port ) although i would be sure that with the mappings these days it should be possible do do .ip .port ?
I like the idea of sub fields too - especially with the possibility of Kibana supporting a tree of fields (not saying it will/does) but it would be cool -- Need to test .ip .port somehow ;)
@electrical Agreed. I'm sure there's a way to template sub-fields, but I've found extending the template to be tedious and painful, at best, and haven't had the patience to nail it down for sub-fields.
@coolacid If you like the idea of subfields, and your intention is to be able to parse/normalize a great deal of different log types, you could thing about prepending network related stuff with "net.*"
You could generate fields like "net.blocked", "net.src.ip", "net.dst.port", "net.l4proto", for instance. This is how I organize the normalized logs in MLSec Project.
This gives me some taxonomy flexibility when I am enriching stuff with passiveDNS data or other sources.
Convo with @untergeek on IRC suggests subfields will be fine - so that's what we'll go with.
Comments on the net.* header -- I don't think it's a bad idea, it would allow us to break out other specific items, which I had todo with things like AV engines.
If it helps as a reference here's the json standard we settled on in MozDef: http://mozdef.readthedocs.org/en/latest/usage.html#json-format very similar to this discussion (cept I hate underscores, but I'm getting over it).
The most helpful part was separating tiers of the event into standard and custom/Detail fields. Standard ones (category, severity, etc) are at the top level of the json doc along with a human readable 'summary' field (think syslog MSG). Details are the things you would parse out of the MSG or tack on if you have a custom event source (like cloudtrail, auditd, compliance data, vulnerability data, etc)
Starting to put some suggestions in here:
https://github.com/coolacid/GettingStartedWithELK/wiki/Field-Standards
Feel free to start adding other ideas.
Going to propose the following (beyond the above mentioned wiki describing Field data).
Type field should be the type of device sending the data - IE: Apache, nginx or what not primarily for filters in the logstash pipeline.
DataType field should be the Data type - IE: Firewall, AntiVirus, WebLog so that Kibana can push a single view for like data.
@coolacid I like your idea of the DataType field, I'm currently using the tags field for this purpose, but I can definitely see the benefit of creating a seperate field for this.
I tend to agree with @torrancew about disliking uppercase field names. Keeping everything lowercase makes it simple, because you don't have to guess which characters are capped.
I also like the idea of subfields, but the structure will need to be thought out very carefully. I feel that currently existing standards should be adhered to as much as possible.
As far as the net.*, you could do it like this: net.proto = protocol net.proto.flags = protocol flags net.src.int = source interface net.src.ip = source ip net.src.port = source port net.dst.int = destination interface net.dst.ip = destination ip net.dst.port = destination port
But remember, there's protocols that don't have source or destination ports, like ICMP. In ICMP, you have a type and code. What do we do in a case like this? Also, in the case of ICMP and a Cisco ASA, you're given faddr (foreign address), laddr (local address), and gaddr (global nat address). It doesn't specify the actual source and destination in some cases. What do we do in that case, where we can't necessarily extract all the information we need out of a log message? This is especially the case with Cisco, where some log messages will provide plenty of data, where as other log messages will provide a very minimal amount of data.
NAT will also potentially pose issues to a standard like this. There's going to be a lot to take into account.
Other potential questions: How do you differentiate between a packet and a flow? Where do we put things like bytes, packet count (in the case of a connection or flow), flow duration? How about NAT or a stateful firewall connection ID?
In one case in my ongoing project here: https://github.com/mepholic/cisco-asa-ls-patterns/ With some of the patterns I currently have defined, I can extract log data on HTTP and FTP inspection that the firewall performs. Included in this data is source and destination IP and port information, but it also has some extra info like FTP user and file, or HTTP URL. This is still technically a network event, as it was extracted from inspection logs, but it also contains application layer data. Does anyone have any suggestions for standardizing field names for data like this?
Encoding things in field names is generally a bad idea for obvious reasons but one convention I am finding useful is to Capitalize (upper case) the field names for data that is directly pulled from a log entry and lower case field names for derived data or metadata. Kibana for example then neatly separates the two types.
I am in the middle of a config for Exim email logs that will work in many different ways eg for analytics or diagnostics and there are a large number of email addresses some with subtle differences in meaning. Some of those addresses are directly mentioned in the logs which is great for diags.and some are derived to make things like thoughput easy to graph (analytics). Being able to tell the difference at a glance is useful.
@mepholic "How do you differentiate between a packet and a flow?"
You don't: A single packet is the shortest example of a flow!
"This is still technically a network event, as it was extracted from inspection logs, but it also contains application layer data"
You could tag these events by (OSI/ARPA) layer. ARPA is probably best although you have what is generally known as a layer 7 filter 8)
Units: Do you put the units in the field name as a suffix or rely on documentation?
Is it bytes or bits? 1024 or 1000? You can rarely tell from inspection.
My personal preference is generally documentation.
I guess this was done? https://github.com/elastic/ecs
Given this project will work on "drop and go" filters for devices by type (ie: Input sets type to "ApacheCombined" and our filter is everything that needs to happen in Logstash for that type) We need to come up with a set of standards.
I'd like to discuss those standards here. ie:
Traffic Sources:
Some things to get started:
The concept is, the filter should modify any fields to the correct "standard". For example, a KV formated firewall log should have a stanza that renames fields to the correct "standard"
Thoughts and other things welcomed.