DISCUSS: Standards - Githubissues

coolacid commented 10 years ago

Given this project will work on "drop and go" filters for devices by type (ie: Input sets type to "ApacheCombined" and our filter is everything that needs to happen in Logstash for that type) We need to come up with a set of standards.

I'd like to discuss those standards here. ie:

Traffic Sources:

IP Address: src_ip or src.ip or ??
Port: src_port or ???

Some things to get started:

Traffic Sources (The source of some kind of traffic -- think firewall logs)
Traffic Destinations
Event Sources (The device that generated the event)
Counts (Packet, Bytes etc)

The concept is, the filter should modify any fields to the correct "standard". For example, a KV formated firewall log should have a stanza that renames fields to the correct "standard"

Thoughts and other things welcomed.

torrancew commented 10 years ago

My own internal convention has largely been underscore-based (ie source_ip, etc) - this is mostly because I can add dynamic mappings in Elasticsearch easily, without caring what the full path to the field is or otherwise worrying with configuring ES to find the field properly, so I can ensure that fields named _ip are stored as IPs with a raw field that's an unanalyzed string, while _count fields are longs without a second component, etc.

coolacid commented 10 years ago

Ok - so underscores are important (vs other options?) What about abbreviations - ie: src vs source.

What things do we need to consider in the standards?

torrancew commented 10 years ago

I generally opt for verbose/explicit where possible, so I'd tend to write "source_ip" (though I contradict myself and use "dest" rather than "destination", so take that with a grain of salt.

coolacid commented 10 years ago

verbose vs screen real estate -- this is why I opted for the src_* instead.

More comments welcomed -- and send to your feeds ;)

shurane commented 10 years ago

src and dest are pretty standard abbreviations in UNIX. Basically if it's common in UNIX or Linux, then might as well follow the convention. I would just avoid new abbrevations. I would prefer checksum vs chksm, or rubydebug vs rbdbg.

If we're just talking about logstash conf files, then I would opt for:

all lower case
Underscores (_) over hyphens (-)
4 space tabs. (Of course, heh)

Regarding fields I don't have as much of a preference. I don't usually change field names.

coolacid commented 10 years ago

@shurane I want to change field names -- when I search for say a src_ip I'd like results from all logs for that IP.. I wouldn't want to have to build a query with multiple fields to say the same thing.

electrical commented 10 years ago

From a programmer's perspective i love having sub fields ( src.ip src.port ) but like @torrancew mentioned its easier to match against fields ( _ip , _port ) although i would be sure that with the mappings these days it should be possible do do .ip .port ?

coolacid commented 10 years ago

I like the idea of sub fields too - especially with the possibility of Kibana supporting a tree of fields (not saying it will/does) but it would be cool -- Need to test .ip .port somehow ;)

torrancew commented 10 years ago

@electrical Agreed. I'm sure there's a way to template sub-fields, but I've found extending the template to be tedious and painful, at best, and haven't had the patience to nail it down for sub-fields.

alexcpsec commented 10 years ago

@coolacid If you like the idea of subfields, and your intention is to be able to parse/normalize a great deal of different log types, you could thing about prepending network related stuff with "net.*"

You could generate fields like "net.blocked", "net.src.ip", "net.dst.port", "net.l4proto", for instance. This is how I organize the normalized logs in MLSec Project.

This gives me some taxonomy flexibility when I am enriching stuff with passiveDNS data or other sources.

coolacid commented 10 years ago

Convo with @untergeek on IRC suggests subfields will be fine - so that's what we'll go with.

Comments on the net.* header -- I don't think it's a bad idea, it would allow us to break out other specific items, which I had todo with things like AV engines.

jeffbryner commented 10 years ago

If it helps as a reference here's the json standard we settled on in MozDef: http://mozdef.readthedocs.org/en/latest/usage.html#json-format very similar to this discussion (cept I hate underscores, but I'm getting over it).

The most helpful part was separating tiers of the event into standard and custom/Detail fields. Standard ones (category, severity, etc) are at the top level of the json doc along with a human readable 'summary' field (think syslog MSG). Details are the things you would parse out of the MSG or tack on if you have a custom event source (like cloudtrail, auditd, compliance data, vulnerability data, etc)

coolacid commented 10 years ago

Starting to put some suggestions in here:

https://github.com/coolacid/GettingStartedWithELK/wiki/Field-Standards

Feel free to start adding other ideas.

coolacid commented 10 years ago

Going to propose the following (beyond the above mentioned wiki describing Field data).

Type field should be the type of device sending the data - IE: Apache, nginx or what not primarily for filters in the logstash pipeline.

DataType field should be the Data type - IE: Firewall, AntiVirus, WebLog so that Kibana can push a single view for like data.

mepholic commented 10 years ago

@coolacid I like your idea of the DataType field, I'm currently using the tags field for this purpose, but I can definitely see the benefit of creating a seperate field for this.

I tend to agree with @torrancew about disliking uppercase field names. Keeping everything lowercase makes it simple, because you don't have to guess which characters are capped.

I also like the idea of subfields, but the structure will need to be thought out very carefully. I feel that currently existing standards should be adhered to as much as possible.

As far as the net.*, you could do it like this: net.proto = protocol net.proto.flags = protocol flags net.src.int = source interface net.src.ip = source ip net.src.port = source port net.dst.int = destination interface net.dst.ip = destination ip net.dst.port = destination port

But remember, there's protocols that don't have source or destination ports, like ICMP. In ICMP, you have a type and code. What do we do in a case like this? Also, in the case of ICMP and a Cisco ASA, you're given faddr (foreign address), laddr (local address), and gaddr (global nat address). It doesn't specify the actual source and destination in some cases. What do we do in that case, where we can't necessarily extract all the information we need out of a log message? This is especially the case with Cisco, where some log messages will provide plenty of data, where as other log messages will provide a very minimal amount of data.

NAT will also potentially pose issues to a standard like this. There's going to be a lot to take into account.

mepholic commented 10 years ago

Other potential questions: How do you differentiate between a packet and a flow? Where do we put things like bytes, packet count (in the case of a connection or flow), flow duration? How about NAT or a stateful firewall connection ID?

In one case in my ongoing project here: https://github.com/mepholic/cisco-asa-ls-patterns/ With some of the patterns I currently have defined, I can extract log data on HTTP and FTP inspection that the firewall performs. Included in this data is source and destination IP and port information, but it also has some extra info like FTP user and file, or HTTP URL. This is still technically a network event, as it was extracted from inspection logs, but it also contains application layer data. Does anyone have any suggestions for standardizing field names for data like this?

gerdesj commented 10 years ago

Encoding things in field names is generally a bad idea for obvious reasons but one convention I am finding useful is to Capitalize (upper case) the field names for data that is directly pulled from a log entry and lower case field names for derived data or metadata. Kibana for example then neatly separates the two types.

I am in the middle of a config for Exim email logs that will work in many different ways eg for analytics or diagnostics and there are a large number of email addresses some with subtle differences in meaning. Some of those addresses are directly mentioned in the logs which is great for diags.and some are derived to make things like thoughput easy to graph (analytics). Being able to tell the difference at a glance is useful.

gerdesj commented 10 years ago

@mepholic "How do you differentiate between a packet and a flow?"

You don't: A single packet is the shortest example of a flow!

"This is still technically a network event, as it was extracted from inspection logs, but it also contains application layer data"

You could tag these events by (OSI/ARPA) layer. ARPA is probably best although you have what is generally known as a layer 7 filter 8)

gerdesj commented 10 years ago

Units: Do you put the units in the field name as a suffix or rely on documentation?

Is it bytes or bits? 1024 or 1000? You can rarely tell from inspection.

My personal preference is generally documentation.

coolacid commented 5 years ago

I guess this was done? https://github.com/elastic/ecs

coolacid / GettingStartedWithELK

DISCUSS: Standards #14