Discuss strategies to reduce number of fields

jsoriano commented 2 years ago

There are some packages in this repository with too many fields on their data streams. Having too many fields can lead to performance degradation and is usually discouraged (see related ES docs).

On https://github.com/elastic/package-spec/pull/278 we introduced a limit of 1024 fields per datastream, but we had to increase it to 2048 in https://github.com/elastic/package-spec/pull/294 to keep builds green for some packages. We would like to be able to have a lower limit, at least by default.

Some questions to discuss:

Is it needed to index all these fields?
Could some of them be replaced by runtime fields? Is there something missing in the tooling or Fleet to support this?
Could some of them be replaced by flattened objects? Is there something missing to support this?
Is this number of fields expected? Should we add a mechanism to optionally increase the limit per data stream to support these cases?
Other ideas to address this?

The packages with more than 1024 fields are:

Netflow, cc @elastic/security-external-integrations
Osquery Manager, cc @elastic/security-asset-management

cc @ruflin @mtojek

mtojek commented 2 years ago

Let's ping folks behind these packages:

Netflow @andrewkroh @marc-gr
Osquery Manager @melissaburpo @aleksmaus

andrewkroh commented 2 years ago

It would be helpful to break down the contents of the netflow integrations fields so I've included a table for reference (also I looked at the all the data streams https://gist.github.com/andrewkroh/885e28b1cdafbeacf0fca20b062a6de2).

There are 1322 netflow.* fields and 425 other fields. The netflow fields are come from IANA specifications (~500) and various vendor extensions. Each field has a specified data type. netflow.* alone puts us over the 1024 limit.

Looking at this list, I suspect that a small number of the included ECS fields are unused. Like as.* is normally not used at the root of an event per ECS.

I'm open to ideas, but I don't see a good way to drastically reduce the field count for netflow.

Count	Namespace
2	@timestamp
5	agent
2	as
30	client
16	cloud
10	container
3	data_stream
31	destination
18	dns
1	ecs
5	error
22	event
22	file
2	flow
8	geo
3	group
4	hash
38	host
10	http
1	input
1	labels
11	log
1	message
1322	netflow
11	network
23	observer
2	organization
6	os
10	package
16	process
1	related
30	server
7	service
31	source
1	tags
7	threat
1	trace
1	transaction
13	url
9	user
10	user_agent

jsoriano commented 2 years ago

@andrewkroh thanks for the analysis. I agree that it can make sense to have so many fields in some cases, but I wonder if we could do something to avoid them being indexed.

How are these netflow.* fields used? Could it be an option to convert the netflow object to the flattened type? Fields would lose their data types (they would be like keywords), but maybe this can be addressed with runtime fields if needed. We would need to add support for runtime fields though.

In any case if we think that a limit of 2048 fields is not so bad, probably we can go on with this.

I would be cautious about adding ways to circumvent this limit, because it could produce packages with way too many fields once we open development to more teams.

ruflin commented 2 years ago

Do we need to index all these fields? My assumption is, the majority of it will not show up in all events. Would it be possible to only index a small portion and for the rest either use runtime fields in the mappings or on query time? Which of these fields do we use for the dashboards? https://github.com/elastic/package-storage/tree/production/packages/netflow/1.4.0/kibana/dashboard

Having the majority as runtime fields would still keep the template itself large but I assume storage wise it would be more efficient. @jpountz Is there a limit / recommendation on the max number of runtime fields someone should use for a signel data stream / mapping?

If we go down that route, we could have 2 limits: 1 for index fields and one for non index fields.

aleksmaus commented 2 years ago

Osquery Manager @melissaburpo @aleksmaus

It's not clear how we can trim the number of mapped fields for osquery. Osquery has 279 tables with dozens of columns that can be queried with any possible query the user can come up with. We don't know in advance what fields users will be searching for. We are open for suggestions, what other options are available hopefully without degrading the already supported functionality.

ruflin commented 2 years ago

When you mention 279 tables, it sounds like there already exists some grouping. What kind of queries are run? Is there normally 1 query per table? If yes, should we have 1 data stream per table (seem extreme too). What about the data retention, is it all the same across all fields / data structures? On ingestion, is one event always going into a single "table" meaning there are 279 different event types?

What about runtime fields, is this an option?

aleksmaus commented 2 years ago

We didn't put any restrictions, it can be any kind of queries, including subqueries, joins of any complexity. Example: https://fleetdm.com/queries/detect-active-processes-with-log-4-j-running There is no reliable way to detect which column belongs to which table in the result. The mapping is the flat union of all the columns across the tables, to the lowest common/compatible datatype.

What would be the runtime fields experience in the kibana? Can the users filter/sort/dissect the received data from the life queries right away with existing kibana UX?

ruflin commented 2 years ago

I had a conversation with @aleksmaus around how exactly the results data is queried in Elasticsearch itself. Results are grouped by query id and normally we have a few 100 to max few 1000 results (rows, ES docs). This means, runtime fields could work great in this use case. The data from Elasticsearch is prefiltered on the query id and then runtime fields are used for running query on the resulting data. If we use runtime fields as part of the mappings, I would expect users would still have exactly the same experience in Elasticsearch or Kibana (for example Lens), but something that should be verified.

@jsoriano @mtojek Are runtime fields today supported in the package spec for mappings?

andrewkroh commented 2 years ago

Regarding Netflow, while there are many possible fields, in my experience only a small subset are used based on the vendor sending data. My Ubiquiti router, for example, populates 26 of the netflow.* fields. Aside from a large mapping, do these extra unused fields create larger indices (I thought they did not)? If that's the case we could recommend to users that they use separate data streams for different vendors (via namespaces) to avoid sparsity issues caused by each vendor using a different subset of netflow fields.

How are these netflow.* fields used? Could it be an option to convert the netflow object to the flattened type?

This data is mostly metrics about a network flows. There are numbers, IPs, dates, and keywords. Only about 17% are keywords. I don't think flattened would be a good use case for the metrics data. We could possibly lump all of the keyword fields into a single flattened field.

   4 "boolean"
   8 "float"
  12 "double"
  27 "date"
  66 "ip"
 165 "integer"
 197 "short"
 231 "keyword"
 612 "long"

Do we need to index all these fields?

Probably not, but it's hard to say what metrics a user will rely on heavily in order to choose the fields to index.

jsoriano commented 2 years ago

@jsoriano @mtojek Are runtime fields today supported in the package spec for mappings?

They aren't supported yet, we have this issue https://github.com/elastic/package-spec/issues/39, we could prioritize it if we find that this could solve this kind of issues.

It'd be good to know though if it is better to have many unused runtime fields than many unused mappings.

If that's the case we could recommend to users that they use separate data streams for different vendors (via namespaces) to avoid sparsity issues caused by each vendor using a different subset of netflow fields.

This could be also a good strategy if having many mappings is not a problem by itself.

jpountz commented 2 years ago

@jpountz Is there a limit / recommendation on the max number of runtime fields someone should use for a signel data stream / mapping?

Runtime fields help by not impacting storage with high numbers of fields, but they still put overhead on things like the cluster state. My gut feeling is that it's not the right answer to this problem. Based on data points on this issue, it looks like some of the fields never get populated. I'd be interested in seeing whether we could get these fields to never be mapped in the first place?

Regarding Netflow, while there are many possible fields, in my experience only a small subset are used based on the vendor sending data. My Ubiquiti router, for example, populates 26 of the netflow.* fields. Aside from a large mapping, do these extra unused fields create larger indices (I thought they did not)?

Unused fields do not make indices larger, actually Lucene never learns about fields that exist in mappings but not in documents, only Elasticsearch knows about these fields.

My first intuition is that handling data where the set of fields that are actually used in practice is not known in advance is a good fit for either flattened when the set of fields is unbounded, or dynamic mappings when we know that there can only be so many different fields. In the case of the Netflow integration and its netflow.* fields, could be rely on dynamic mappings (possibly with a few dynamic mapping rules for specific types like IP addresses)?

andrewkroh commented 2 years ago

In the case of the Netflow integration and its netflow.* fields, could be rely on dynamic mappings (possibly with a few dynamic mapping rules for specific types like IP addresses)?

That sounds like a good approach.

I would keep the mappings in place for the float/double/date/ip fields and rely on dynamic for all the other netflow.* fields. The reason for keeping float/double is to prevent getting the mapping wrong in case the first value happens to be a 0 or some non-floating point value. The reason for date is that date_detection is turned off by Fleet. And ip is because I don't see a reliable path_match rule to apply. If we were willing to cause a breaking change we could rename some fields to establishing a naming convention that makes it trivial to apply path_match rules. (That's something to keep in mind for new integration development.)

  8 "float"
  12 "double"
  27 "date"
+ 66 "ip"
-----------
  113 netflow fields
+ 425 non-netflow fields
-----------
  538 total fields

ruflin commented 2 years ago

On the IP side, it would be nice if ES would have a feature to "detect" IP addresses based on the pattern of these fields.

@andrewkroh For the naming convention, are you thinking of something like .ip or _ip for matching? You mention the 425 non-netflow fields? What are these?

andrewkroh commented 2 years ago

@andrewkroh For the naming convention, are you thinking of something like .ip or _ip for matching?

Yes, I was thinking of a suffix based on data type like _ip. But I'm not planning any changes now because it would be a breaking change.

You mention the 425 non-netflow fields? What are these?

This are the fields I mention in https://github.com/elastic/integrations/issues/2839#issuecomment-1071132193. They are mostly ECS fields and a few Filebeat fields. And as I mentioned there, I suspect that many of those are unused and that list of ECS fields can be drastically pruned if we do a thorough analysis.

jpountz commented 2 years ago

On the IP side, it would be nice if ES would have a feature to "detect" IP addresses based on the pattern of these fields.

We could do something like that. We'd just need to be careful that IP addresses can be very short strings like ::1 so we should make sure that none of the non-IP fields could ever take something that looks like an IP address.

Out of curiosity, is the type known on the agent side? If so, agent could send dynamic mappings as part of the bulk request to help Elasticsearch make the right decision?

Separately, I've been looking at the fields of the Netflow integration, and it looks like we're always using different field names for IPv4 and IPv6 addresses, e.g. netflow.post_nat_source_ipv4_address and netflow.post_nat_source_ipv6_address. It makes it hard to e.g. compute top IP addresses across both fields. I guess we're doing this to reflect fields that are populated by the Netflow integration, but it likely makes it harder to analyze the data compared to if both IPv4 and IPv6 were stored in the same field.

andrewkroh commented 2 years ago

it looks like we're always using different field names for IPv4 and IPv6 addresses, e.g. netflow.post_nat_source_ipv4_address and netflow.post_nat_source_ipv6_address. . . I guess we're doing this to reflect fields that are populated by the Netflow integration

Rather than modify the original netflow data, the Filebeat input passes through the fields with minimal changes. This includes keeping the original field names (with a change the snake case).

So post_nat_source_ipv4_address maps to IPFIX postNATSourceIPv4Address. If you want a normalized field to aggregate on then you could use ECS source.nat.ip (I think that's what both post_nat_source_ipv4_address and post_nat_source_ipv6_address map to).

ruflin commented 2 years ago

For the ip fields, could we match on *IPv4*, *IPv6* or similar? Does ES support something like this?

ruflin commented 2 years ago

Linking to https://github.com/elastic/kibana/issues/128152 here as "too many fields" in a data stream also has effects on query.default_field.

jpountz commented 2 years ago

Elasticsearch does support matching field names using wildcards, but it looks like it wouldn't work since some fields have ipv4 in their names but they are not IP addresses, e.g. netflow.destination_ipv4_prefix_length (short).

pzl commented 2 years ago

Wanted to add some discussion here about what to do in the endpoint package. Most of our data stream counts are under control:

Data Stream	Field count
action_responses	35
actions	37
alerts	1456
collection	20
file	171
library	161
metadata	54
metrics	107
network	153
policy	104
process	345
registry	114
security	107

With alerts being the outsized problem here.

By namespace:

Count	Namespace
1	ecs
1	Events
1	message
1	`@timestamp`
2	dns
2	elastic
3	data_stream
3	registry
5	agent
6	Memory_protection
7	group
10	rule
11	Responses
12	destination
12	source
16	event
17	Endpoint
17	user
30	Ransomware
38	dll
47	host
92	file
309	Target
389	process
424	threat

I'm not sure what a good reduction strategy would be here.

mtojek commented 2 years ago

@pzl Regarding the "alerts" datastream, do you use runtime fields or standard ones?

pzl commented 2 years ago

Only standard I believe. The fields.yml is here

mtojek commented 2 years ago

I admit that I don't know the domain and you're experts there, but it's hard to believe that we need so many fields there. What are the typical queries for this data? Similar question as here: https://github.com/elastic/integrations/issues/2839#issuecomment-1072536179

ruflin commented 2 years ago

@pzl Who creates the alerts data. Is this shipped by the endpoint binary?

pzl commented 2 years ago

Yes, this is for data sent by the endpoint binary. I am reaching out to find a person who can speak to the particular uses of the alerts index

kevinlog commented 2 years ago

@ruflin @mtojek I met with some stakeholders on the Security side regarding our usage of the the alerts datastream.

Currently, this index sits at 1456 mapped fields. The alerts datastream is the one that the Endpoint Security binary streams all alerts documents to which represents potential threats on our users' remote machines. For instance, if a malicious file attempts the run a host, the Endpoint will detect it and then send an Alert document to ES to notify the Security user/analyst that a machine on their network was attacked. Alerts are the most important type of data that the Security Endpoint streams to ES and is the heart of most use cases in an Endpoint Detect Response product.

There are about 5 different alert types that can be streamed to this datastream. Many mapped fields overlap, however there are also several sets of mapped fields that would only be streamed for a particular alert type. In addition, depending on the details of a potential attack, even alerts of the same type could stream different fields. This is the reason for so many mapped fields. It is also important to note that a single Alert document will certainly not contain a majority of these fields at once, but the variability of Alerts is the driver for the number of fields. @magermark could potentially give a rough estimate of the number of fields Alerts docs have on average, if that's helpful.

We're apprehensive to prune too many fields because we don't want to limit users on how they can search, build rules, and visualize their alerts data. There are certainly fields we could remove from mappings, but we're likely to add additional fields in the future, so while we could potentially reduce the number to ~1024, it would be a fairly temporary fix as we add more features to Alerts.

Some options we discussed:

Manually prune fields that we don't think need to be mapped
- As stated above, we could probably get the number down low enough, but it could grow again as we add new features to Alerts
Add an exception field to the package-spec to allow a datastream to have more than 1024 mapped fields in a package.
- This is an Elastic maintained package, maybe it's OK to allow exceptions for those, but not external packages?
- I see this is mentioned in the issue description, Alerts may be a good candidate for this.
Split the Alerts datastream into new datastreams representing individual alert types.
- This could spread the mappings across different datastreams
- This causes backwards compatibility issues since we need to support older Endpoint binaries that stream only to the existing Alerts datastream. Because of this, I'm not sure we can eliminate the original Alerts datastream initially anyway, so we'd still have one with many mapped fields.
- In any case, I wouldn't consider this a short term fix. It could be a few release cycles before we had bandwidth to do this

Happy to explore other options. Perhaps we could make more use of dynamic mappings or flattened objects, but I would need to understand more about them before decided which fields to replace.

cc @magermark @joe-desimone @pzl @ferullo

ruflin commented 2 years ago

Does the endpoint binary know about the type of each field? Or does endpoint rely on Elasticsearch to define the mapping? I'm asking this because one option could be that the field type is shipped as part of the event instead of having the mapping set in advance. But likely this would cause problems on "too many permissions".

One of the key points above is that the number of fields is fixed and known in advance. It sounds like the number could still increase but it is not like it will double just in a few weeks. Having that many fields is not necessarily a problem but often it indicates a smell. But for the alerts scenario, I'm ok if we increase the limit.

My general concern is that if we introduce this flag, other data streams will just enable it without having the detailed discussion we had here.

Is using runtime fields instead of mapped fields an option? What is the number of alerts that max exist in such a data stream that need to be crawled?

botelastic[bot] commented 1 year ago

Hi! We just realized that we haven't looked into this issue in a while. We're sorry! We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

zez3 commented 8 months ago

:(

elastic / integrations

Discuss strategies to reduce number of fields #2839