elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.4k stars 24.87k forks source link

[Ingest] Expose domainSplit() in ingest script processor and possibly aggregations #36359

Open LucaWintergerst opened 5 years ago

LucaWintergerst commented 5 years ago

Describe the feature: The domainSplit() painless method allows to split domains into their parts (subdomain, tld, ... ). This was first introduced when Machine Learning was integrated into Elasticsearch. It was exposed as part of scripted fields to allow ML jobs to work if they need that information.

However, this functionality is also incredibly useful as part of ingest. No other part of our stack has a substitution for this (apart from packetbeat that does something similar by default). There's also no good workaround as the public suffix list is required to do "good" domain splitting and scripted fields alone do not allow it being used in many parts of Kibana. Furthermore there's likely also a small performance hit.

@rjernst and @polyfractal discussed this briefly and agreed that it makes sense to have.

One remaining question to work out is if it also makes sense to have this available in scripted aggregations.

elasticmachine commented 5 years ago

Pinging @elastic/es-core-infra

jakelandis commented 5 years ago

We discussed this today in Fixit Friday and agreed that this would be useful in other parts of Elasticsearch, and something that we want to purse.

We still need discuss with @elastic/machine-learning team if they are agree-able to move this code from ml to a more common place in the source tree (and possibly require a re-license). We also need to discuss how to maintain the list of top level domains.

droberts195 commented 5 years ago

We also need to discuss how to maintain the list of top level domains.

One option would be to work off the public suffix data file instead of the compressed version embedded in the code. We could ship public_suffix_list.dat as a resource file and parse it at startup. Then updating it would simply become a case of updating that file in the source tree. (Or we could ship it as a config file and parse it from the config directory if we wanted end users to be able to update it independent of a new release.)

We actually had some C++ code to do this in a previous product - I'll dig it out for you.

elasticmachine commented 5 years ago

Pinging @elastic/es-core-features

mbudge commented 4 years ago

The public suffix file is the best way to get the top level domain, subdomain, registered domain, root domain and last but not least the domain.

elasticsearchmachine commented 3 days ago

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticsearchmachine commented 3 days ago

Pinging @elastic/es-data-management (Team:Data Management)