elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.49k stars 8.05k forks source link

@kbn/config-schema is slow #78351

Open kobelb opened 3 years ago

kobelb commented 3 years ago

At the moment, @kbn/config-schema is slow because Joi is admittedly slow, and not focused on increasing performance in the short-term: https://github.com/sideway/joi/issues/2340#issuecomment-608003375.

While performing some performance profiling for Fleet, I'm seeing around 6% of the total CPU time being spend performing validation with @kbn/config-schema.

Screen Shot 2020-09-23 at 1 48 39 PM

I think we have two options here, embrace the fact that @kbn/config-schema is too slow for some purposes, and no longer use it situations like route validation. Change @kbn/config-schema to no longer rely on Joi, and work on improving its performance.

elasticmachine commented 3 years ago

Pinging @elastic/kibana-platform (Team:Platform)

rudolf commented 3 years ago

I didn't investigate this properly, so I would have to verify it again to be sure, but I remember seeing a significant portion of startup CPU time is also spent inside @kbn/config-schema. It would be worth investigating this, if it's significant there would be even more motivation to increase the performance of @kbn/config-schema.

Having said that, I don't think it's worth for Kibana to maintain it's own validation library. Because config validation is part of the Core API we need to have control over that so it makes sense to use @kbn/config-schema, but instead of spending time making that faster and increasing the API surface, I think we should rather limit the time we spend on it and limit the API to only what's necessary for config validation.

pgayvallet commented 3 years ago

I think we have two options here, embrace the fact that @kbn/config-schema is too slow for some purposes, and no longer use it situations like route validation. Change @kbn/config-schema to no longer rely on Joi, and work on improving its performance.

Aren't both options the same? We will still need validation of some kind for routing, so both options means either finding a replacement for joi, or implement our own as config-schema underlying validation library, don't they? From what I've seen, some of our route validations are not 'less complex' that our config file validation, so we would need almost the same features for both.

Also, sorry to nitpick, but I'm not sure a 6% overheat issue can be labeled as bug? Isn't performance enough?

kobelb commented 3 years ago

Aren't both options the same? We will still need validation of some kind for routing, so both options means either finding a replacement for joi, or implement our own as config-schema underlying validation library, don't they? From what I've seen, some of our route validations are not 'less complex' that our config file validation, so we would need almost the same features for both.

In the context of route handling, both options are similar in that we don't want to use Joi to do route level validation. However, I still think there are two different options in regard to what we do with @kbn/config-schema as a whole.

  1. Admit @kbn/config-schema is slow, and don't use it where performance matters
  2. Fix @kbn/config-schema so it's now fast, can be used everywhere.

Also, sorry to nitpick, but I'm not sure a 6% overheat issue can be labeled as bug? Isn't performance enough?

I don't think we have a consistent way of labeling performance issues, so feel free to use whatever labels you desire :)

pgayvallet commented 3 years ago

Admit @kbn/config-schema is slow, and don't use it where performance matters

(not sure if I should comment here or in https://github.com/elastic/kibana/issues/78353, but) are all route validations concerned by this, or only specific bottlenecks?

If that's only bottlenecks, core's route validation system allows to use an arbitrary validation function instead of config-schema, so the quick win should probably be to use manual validation for these endpoints.

If all routes validations are considered as a potential performance issue, we need to find a faster alternative to joi , and document it as the preferred validation system for routing. But if we go down that road, 'Fixing' config-schema to change the underlying validation library is probably the way to go, as ideally, we wouldn't be using two distinct libraries.

As a sidenote, if we keep config-schema for route validation, we might want to rename it to something like @kbn/validation-schema or something.

mshustov commented 3 years ago

We will still need validation of some kind for routing

I think @rudolf meant that we shouldn't enforce @kbn/config-schema as a standard for the routing validation. However, we already allow plugins to use any custom validation library. My main concern about the topic: what if a plugin is not a hot path and called not so often as Fleet endpoints? They might be interested in using whatever tool off-the-shelf and don't spend their time investigating alternative options.

From what I've seen, some of our route validations are not 'less complex' that our config file validation, so we would need almost the same features for both.

Another concern is the tooling fragmentation. It's almost impossible to make sure all the validation library implements the same security model (for example, filtering out the raw values from error messages https://github.com/elastic/kibana/pull/58843).

While performing some performance profiling for Fleet, I'm seeing around 6% of the total CPU time being spend performing validation with @kbn/config-schema.

Is it a bottleneck for Fleet? We might need to add a benchmark setup to get absolute numbers to understand which order of numbers we are talking about - millions or thousands of operations per second. Probably, updating the Joi version helps us to increase that limit to reasonable values.

kobelb commented 3 years ago

(not sure if I should comment here or in #78353, but) are all route validations concerned by this, or only specific bottlenecks?

This issue is meant to address all usages of @kbn/config-schema everywhere. Built-in route validation, custom route validation, config, etc.

If that's only bottlenecks, core's route validation system allows to use an arbitrary validation function instead of config-schema, so the quick win should probably be to use manual validation for these endpoints.

Agreed, this is possible for the custom validation for the routes. This is what we did for Fleet, no longer use @kbn/config-schema or Joi for the route validation.

I think @rudolf meant that we shouldn't enforce @kbn/config-schema as a standard for the routing validation. However, we already allow plugins to use any custom validation library.

If we decide that @kbn/config-schema is slow and we aren't going to invest the effort to make it fast, I think we should stop encouraging plugins to use @kbn/config-schema for their route validation. With the way the API is currently created, it leads plugin developers to default to using @kbn/config-schema

My main concern about the topic: what if a plugin is not a hot path and called not so often as Fleet endpoints? They might be interested in using whatever tool off-the-shelf and don't spend their time investigating alternative options.

If a usage of @kbn/config-schema isn't in the hot path, it's not as large of a concern. However, I don't think we should be encouraging developers to use something that is slow and won't scale.

Is it a bottleneck for Fleet? We might need to add a benchmark setup to get absolute numbers to understand which order of numbers we are talking about - millions or thousands of operations per second. Probably, updating the Joi version helps us to increase that limit to reasonable values.

Yes, it is a bottleneck. The current bottleneck is because of the use of @kbn/config-schema for the built-in route validation, which is why I created https://github.com/elastic/kibana/issues/78353 to work-around this specific bottleneck.

mshustov commented 3 years ago

Yes, it is a bottleneck. The current bottleneck is because of the use of @kbn/config-schema for the built-in route validation, which is why I created #78353 to work-around this specific bottleneck.

@kobelb I'm going to create a benchmark (after FF) to compare @kbn/config-schema with joi, io-ts to have particular numbers we can discuss here.

afharo commented 3 years ago

Sharing an existing benchmark I found comparing joi, io-ts and others: https://github.com/gcanti/io-ts-benchmarks

mshustov commented 3 years ago

3 years ago

ok, might require some work to actualize 😅

afharo commented 3 years ago

True 🤣 but we have the code in place to run it :)

afharo commented 3 years ago

Updated it in this PR: https://github.com/gcanti/io-ts-benchmarks/pull/4

The updated results in here: https://github.com/afharo/io-ts-benchmarks/blob/update-deps/README.md#results

wylieconlon commented 3 years ago

We recently chose to remove schema validation for TSVB because of the performance issue: https://github.com/elastic/kibana/issues/97061, but I think we are still missing a clear set of recommendations to Kibana developers. Should we continue developing with schemas? Should we stop using them until a fix is made?

delvedor commented 3 years ago

In Fastify we use ajv, it offers amazing performances, and it also allows you to pre-compile the schema, speeding up even more the start-up phase. For generating the schemas without writing plain js objects, we have developed fluent-json-schema, while if you need a strongly typed solution, there is @sinclair/typebox (take a look here for other solutions).

kobelb commented 3 years ago

When evaluating alternatives, we should keep in mind that we'd like to minimize, if not eliminate, the places of server-side code that do code generation from strings at runtime. Code generation from strings is a common attack vector for turning prototype pollution or other logic errors into remote code execution.

delvedor commented 3 years ago

@kobelb if you don't generate the schemas each time at startup and the definitions don't change much, you can directly commit the generated validator code. This is what we do in Fastify, which is generated here every time we have a config change.

pgayvallet commented 3 years ago

When evaluating alternatives, we should also keep in mind that we'd ideally like feature parity with the parts of joi/config-schema the Kibana codebase is currently using. Else we will not be able to replace config-schema underlying lib and propose an alternative of config-schema instead.

E.g

Now, we would also need to check which features are currently only used for config validation (not necessary for the alternative), and which are used for actual route validation.

wylieconlon commented 3 years ago

I like that you're starting to evaluate alternatives, but maybe my question was lost in the discussion. Asking again:

We are still missing a clear set of recommendations to Kibana developers. Should we continue developing with schemas? Should we stop using them until a fix is made?

What is your advice to current developers, and can this be shared widely among the Kibana contributors?

mshustov commented 3 years ago

Should we stop using them until a fix is made?

@wylieconlon Migrating from config-schema creates fragmentation in our tooling and it will be hard to switch plugins back to config-schema. Let's continue using the common library. When we address the problem, all the endpoints will benefit from it. Is there an example where config-schema validation affects response time? We can use it as a baseline for further refactoring.

Maybe we can improve the situation by updating joi version? According to the @afharo's benchmark, updating joi from v14 to v 17 gives almost x2 improvement for the valid dataset, x3 improvement for the invalid dataset.

Kibana is still on v13, in our case we might gain even more.

wylieconlon commented 3 years ago

@mshustov I'm not sure you've addressed the concerns that I have, and that @kobelb originally raised in this issue. It appears that no validation is a superior option to the current library.

There are two main examples where config-schema is a bottleneck. This issue started as a discussion based on performance bottlenecks in Fleet APIs, and more recently we found a ~20-25% improvement in API response times for TSVB after removing schema validation, see this comment. You can also see the private issue linking to this one for more background.

mshustov commented 3 years ago

This issue started as a discussion based on performance bottlenecks in Fleet APIs,

IIRC Fleet had quite specific requirements to support up to 50k agents polling the Kibana server every X seconds. Since fleet switched to a custom server, we abandoned the current issue as there were no other examples, where @kbn/config-schema was a bottle-neck. Now, we can consider prioritizing it again.

and more recently we found a ~20-25% improvement in API response times for TSVB after removing schema validation,

I haven't checked the size of the TSVB payload, let me see how big it is. Anyway, disabling validation is not a viable option, it's better to switch to another library. Do you want Core to provide recommendations on a substitution? It definitely requires additional investigation.

kobelb commented 3 years ago

It appears that no validation is a superior option to the current library.

What? How did you come to this conclusion? Validation is absolutely necessary and should not be removed because it incurs a performance penalty. If there's a situation where performing validation using @kbn/config-schema is too slow, we should use a different approach to validation, not remove it entirely. This is why we support custom validation for HTTP requests. We've had some teams succeed with using io-ts, but I don't have any first hand experience to comment on whether or not we should be recommending it.

Removing the airbags and seatbelts from a car will make it go faster, and also make it more likely for you to be killed/injured. Everything is a trade-off.

wylieconlon commented 3 years ago

You can find context for the decision to remove schema validation in this comment and the whole issue that it's a part of. I've also just linked you to more context in a private issue. We are still using prototype pollution checking on top of the Joi validation: https://github.com/elastic/kibana/pull/85952.

mshustov commented 3 years ago

I haven't checked the size of the TSVB payload, let me see how big it is.

@wylieconlon /api/metrics/vis/data route validation has been removed in https://github.com/elastic/kibana/pull/97091, so I cannot check validation performance on master. I downloaded v7.12 dist, installed the sample data set, and navigated to the [Logs] Web Traffic dashboard. According to my measurements, validation takes from 1ms to 4ms. Is there a way to reproduce the performance problem you mentioned?

JFYI: Some of the frames outlined in https://github.com/elastic/kibana/issues/97061#issue-857368681 belong to internal Core validation.

lizozom commented 1 year ago

@kobelb @lukeelmers what is the status on this issue?

kobelb commented 1 year ago

@lizozom @kbn/config-schema is still slow

pgayvallet commented 1 year ago

@kbn/config-schema is still slow

How dare you.

lukeelmers commented 1 year ago

@lizozom Yeah, it's still slow. In theory we got a 2x-3x performance improvement from the upgrade in https://github.com/elastic/kibana/pull/99899, but otherwise we haven't been actively working on this, and it isn't currently on the near-term roadmap.

If we want to prioritize this, my recommendation would be that we start by gathering a few scenarios where this is becoming a significant bottleneck in Kibana. If we had some numbers to use as a baseline that are from real-world Kibana use cases, and some repeatable benchmarks we could run, it'd give us a framework to understand how critical this is relative to other potential performance enhancements. (This would be especially useful since solving this particular issue could require a substantial investment, especially if we are trying to keep feature parity with joi)

pgayvallet commented 1 year ago

This would be especially useful since solving this particular issue could require a substantial investment, especially if we are trying to keep feature parity with joi

Ihmo we unfortunately just won't be able to provide feature parity with joi, at least not if the intent is to effectively have a performance improvement using the replacement. In short, given the features covered by joi (I'm thinking default values, custom validation message per type of error, or 99 other small things), no replacement lib actually does the same as joi faster, plain and simple (at least from the comparisons we did a while back).

However, I agree that identifying bottlenecks would be the key here. If we see that some routes are having their validation as a significant perf bottleneck, we could improve our custom route validation support (which, I'd like everyone to remember, is something that we already support) to have a better type support (in a similar way of what we're doing with config schema -> request body/params type inference/conversion). We could imagine, by improving our route validation types, to have the same kind of inference, for, say, io-ts.

Note that even if we were to do this, the code / route owners would still have to migrate their validation to the new validation lib, and to adapt their code for all the sugar that is not necessarily present in the replacement/alternative validation library (e.g last time I check, there wasn't native / out of the box support for default values for io-ts)

cnasikas commented 1 year ago

I recently came across to zod and I think it is a library worth considering. It is heavily inspired by io-ts. io-ts is an excellent library but I think it has quite the learning curve for developers without functional programming experience.

lizozom commented 1 year ago

I think it would be interesting to benchmark the impact of validation on several APIs, so we can quantify the problem. @kobelb @pgayvallet do you have an intuition as per which APIs might be heavily impacted by validation?

pgayvallet commented 1 year ago

do you have an intuition as per which APIs might be heavily impacted by validation?

By APIs, you mean HTTP APIs, right?

I don't really have any idea honestly. IIRC:

rudolf commented 1 year ago

I don't think doing a bottom-up approach to performance is fruitful. We need to start at the business level and work down to achieve our objectives. I would recommend we start by:

  1. Ask teams to define Service Level Objectives (SLO) for each of their APIs
  2. Track and report on the measured performance vs SLO using cloud proxy logs
  3. Investigate instances where we didn't meet our SLO
  4. Reproduce the problem then use Nodejs profiling to understand what are the CPU bottlenecks and address those

Otherwise we'll be doing premature optimisation without being sure what the business impact would be. The only way we could do a large scale generic performance improvement would be if we could collect CPU profiles from across cloud https://github.com/elastic/cloud/issues/103563 https://github.com/elastic/prodfiler/issues/2508