Support ML trained models in integration packages

peteharverson commented 3 years ago

The initial focus for adding ML components to integration packages will be including anomaly detection job configurations. However the second phase will look at adding assets for pre-trained models, such as classification models for detecting DGA (domain generating algorithms) domains for security data.

The assets would include an ingest pipeline, and the pre-trained model, which typically range in size from 10s of MBs up to several GBs. There might also be an associated data stream, for example security data.

Due to its potentially very large size, the pre-trained model should be downloaded and installed on demand, separate to the dependent package. The model could just be downloaded via a link, rather than being part of a package.

The trained model will have a license type, but this is not contained with the model schema. Currently a platinum license is required to use the create trained model API, so the user should not be able to deploy the model if they don’t have this license type.

It is likely that models will be updated over time, and the user should be able to upgrade the model. Models will be versioned, and this may be different to the version used for the dependent package. If possible, we should look to fit in with the current package manager upgrade mechanism.

We are looking at a time frame of 8.0 for this second phase of the ML - integration packages work.

mtojek commented 3 years ago

cc @ycombinator @ruflin

My first idea for handling (downloading) large assets would be sharing a CDN link to the asset. They can be lazily downloaded and we don't have to store them in the package-storage. The ML team can prepare and upload the content to some bucket (behind CDN).

stevedodson commented 3 years ago

@ruflin @andresrc @mtojek - the priority of this work has increased as we have several pre-trained models for the Security solution, and there are additional models in development (including some large >50MB models for NLP work in 8.x). Until we have a simple mechanism for users to install these models, it is difficult for users to use this valuable content.

This models manifest themselves as documents in an index, and are generally packaged with ingest pipelines and other assets.

I understand from @peteharverson there are issues packaging these as assets in existing packages, and it would be good to explore concretely what the options are.

For example, if there is a CDN link to the asset, how is this managed and who is responsible for the CDN, security and the asset?

Also, how is the asset then installed? This should be as seamless as possible for a user, and so needs to work inside Kibana - ideally from 'Integrations'.

It doesn't seem sensible for the ML team to create a new process or repository for managing these assets.

Let me know the best way to progress this?

ruflin commented 3 years ago

In general I like the idea of having to maintain only 1 deliver mechanism which today is a package. The package registry itself is mostly a glorified S3 bucket with a bit of search on top. There might be issues on how we release packages on Github with branches today with these larger assets but it is something we need to address at one point anyways so I'll put this part aside also to simplify it.

A parallel discussion is happening with the APM team around sub / reference packages to potentially ship APM Agents (@Mpdreamz will soon open an issue) which sounds similar. The main package with the ingest pipeline and dashboards should be quick to install and the model(s) can be pulled down separately.

Around the installation, on part I'm curios is how we will handle this with Elasticsearch. Is each model many small ES documents (what is the size of each doc) and should be bulk load these or do we likely need to stream the model? Today I think we keep a package all in memory during installation which might not be feasible here anymore ( @jen-huang ).

To move this forward, I think we should start to dive deeper into the technical details on how these models are installed, what APIs are used, how these are upgraded to then see how these fit best into a package or reference etc.

stevedodson commented 3 years ago

++ on having a single delivery mechanism.

To clarify, our requirements are to have a package installation mechanism that could install artifacts such as ingest pipelines and scripts and an ML model as a package. All nicely wrapped up in Kibana!

In terms of installation, our current models are < 100MB, so we can install them via:

PUT _ml/trained_models/<model_id> where the body is a single JSON object (e.g. ML_DGA_model.json)

Internally, we then manage how this is represented as documents in ES.

If models are > 100MB we may need to add an api to upload the models in chunks (e.g. PUT _ml/trained_models/<model_id>/0 PUT _ml/trained_models/<model_id>/1) but this design is still TBD (@davidkyle).

mtojek commented 3 years ago

There might be issues on how we release packages on Github with branches today with these larger assets but it is something we need to address at one point anyways so I'll put this part aside also to simplify it.

Please consider also timeline/roadmap, that it might not happen soon (before redesigning https://github.com/elastic/package-registry/issues/670). We can't bundle everything into Docker image at the moment and publish a distribution weighting few GBs (we need to load image into memory).

sophiec20 commented 3 years ago

Just to set expectations about near term, there are currently two supervised models which are 3MB and 45MB that we would like to ship as packages. These are ready to go, but currently with a manual out-of-band deployment method which is a limiter to adoption.

It would be great if there is a size tolerance we can work within in a "phase 1" then that would allow these models to be more easily on-boarded within solutions. I understand the concerns and can see how growth in the number of supervised models and sizes would require redesign.

ruflin commented 3 years ago

@mtojek Would it help if we set a size limit (100MB for example) on assets for now? To be honest, it is something we should likely introduce anyway for all the assets.

@jen-huang @joshdover Are there any limitations in Kibana on the max size of a document that can be pushed to an Elasticsearch or Kibana API?

mtojek commented 3 years ago

@mtojek Would it help if we set a size limit (100MB for example) on assets for now? To be honest, it is something we should likely introduce anyway for all the assets.

I'm not convinced about setting such limit. With current solution design packages should tend to be as small as possible. With 100MB per file we will allow to have multiple revisions, let's say 10 that can weight 100MB each. We can easily reach few GBs with that and the Package Registry will become undeployable.

I don't see any perfect option paired with current design, but CDN appears to be good workaround.

cc @jsoriano

jsoriano commented 3 years ago

Let's separate this into two problems:

Define the package spec for ml trained models opening a PR in this repo.
Improve handling of big packages in the registry (as part of https://github.com/elastic/package-registry/issues/670).

For the first point I think that we can already start, thinking about the smaller models mentioned in https://github.com/elastic/package-spec/issues/135#issuecomment-888185701, some tenths of MB will be noticed in current registry but shouldn't be a problem. Once the spec is known, fleet team can start working on supporting the installation of these assets.

I would avoid setting size limits by now (in the spec), having big packages looks like a feasible use case. But informaly we should keep an eye on the models introduced, given the limitations of current registry.

I would also avoid having packages with links to external resources if possible, this can complicate other efforts as the offline registry, or package signing.

joshdover commented 3 years ago

Are there any limitations in Kibana on the max size of a document that can be pushed to an Elasticsearch or Kibana API?

Yes there is:

Kibana has a server.maxPayload setting which defaults to 1048576 bytes or roughly 1MB.
- This can be overriden on a per-endpoint basis in code by setting the maxBytes option on a route.
- Many proxies have a default of 1MB here as well (nginx for example)
- Using a streaming / chunked content-type is one way around this limitation
Elasticsearch has a http.max_content_length setting which defaults to 100MB

ruflin commented 3 years ago

@joshdover As the package sizes are likely to grow in the future we should investigate in Fleet if there are any side effects but we can discuss this separately.

peteharverson commented 3 years ago

I will work with @alvarezmelissa87 to open a PR in this repo which defines the new ML trained model asset type. We can then work with the Fleet team to add support for installing the asset using the existing create trained model API.

andrewkroh commented 3 years ago

What integration packages will use these ML models? Or are they only used within the package that contains them? I assume there will be an ingest pipeline that uses the inference processor with these models, but what data is going through that pipeline?

For example, with the DGA model, I'm wondering about the possibility of using it to enrich several existing data streams like packetbeat DNS, zeek DNS, suricata DNS, and sysmon DNS.

The reason I ask is because there isn't currently any sort of inter-package dependencies. So if that is a use case then we'll want to plan for how it works.

alvarezmelissa87 commented 3 years ago

@andrewkroh - The models will only be used within the package that contains them. Yes, there is an ingest pipeline included - as far as what data will be used, that's something we're still working out but the long term plan is to try and include some test data to get the user started.

As far as inter-package dependencies - we haven't discussed that yet as far as I know but once the user has the model and pipelines included they should be able to leverage those for enriching other incoming data - though that would take some manual set up on the user's side.

andrewkroh commented 3 years ago

they should be able to leverage those for enriching other incoming data - though that would take some manual set up on the user's side.

We should document how users can leverage these from other integrations because it's an important use case (perhaps the primary one?). AFAIK even with manual setup it's not possible (unless users manually modify pipelines installed by other integrations and those changes would be lost on package upgrades). I would like us to think through this use case.

ruflin commented 3 years ago

++ on thinking through this use case. I would also like to learn more about the ingest pipeline that is shipped with it? Any example?

sophiec20 commented 3 years ago

For reference, a blog describing how to on-board DGA is here https://www.elastic.co/blog/supervised-and-unsupervised-machine-learning-for-dga-detection and ProblemChild is here https://www.elastic.co/blog/problemchild-generate-alerts-to-detect-living-off-the-land-attacks. These both contain links to the detection-rules releases which includes the pipelines.

alvarezmelissa87 commented 3 years ago

Closing this via https://github.com/elastic/package-spec/pull/204

elastic / package-spec

Support ML trained models in integration packages #135