Open LucaWintergerst opened 5 years ago
Pinging @elastic/es-core-infra
We discussed this today in Fixit Friday and agreed that this would be useful in other parts of Elasticsearch, and something that we want to purse.
We still need discuss with @elastic/machine-learning team if they are agree-able to move this code from ml to a more common place in the source tree (and possibly require a re-license). We also need to discuss how to maintain the list of top level domains.
We also need to discuss how to maintain the list of top level domains.
One option would be to work off the public suffix data file instead of the compressed version embedded in the code. We could ship public_suffix_list.dat
as a resource file and parse it at startup. Then updating it would simply become a case of updating that file in the source tree. (Or we could ship it as a config file and parse it from the config directory if we wanted end users to be able to update it independent of a new release.)
We actually had some C++ code to do this in a previous product - I'll dig it out for you.
Pinging @elastic/es-core-features
The public suffix file is the best way to get the top level domain, subdomain, registered domain, root domain and last but not least the domain.
Pinging @elastic/es-core-infra (Team:Core/Infra)
Pinging @elastic/es-data-management (Team:Data Management)
Describe the feature: The
domainSplit()
painless method allows to split domains into their parts (subdomain, tld, ... ). This was first introduced when Machine Learning was integrated into Elasticsearch. It was exposed as part of scripted fields to allow ML jobs to work if they need that information.However, this functionality is also incredibly useful as part of ingest. No other part of our stack has a substitution for this (apart from packetbeat that does something similar by default). There's also no good workaround as the public suffix list is required to do "good" domain splitting and scripted fields alone do not allow it being used in many parts of Kibana. Furthermore there's likely also a small performance hit.
@rjernst and @polyfractal discussed this briefly and agreed that it makes sense to have.
One remaining question to work out is if it also makes sense to have this available in scripted aggregations.