Closed dannyfallon closed 1 year ago
I can just second this. I was also very surprised about this behavior. Maybe an env variable to disable any outgoing traffic?
But this should be opt-in.
BTW, is there any other such traffic from ES? Phoning home maybe?
Pinging @elastic/es-data-management (Team:Data Management)
Heya, thanks for opening an issue for this. We agree this can be pretty annoying to have to deal with, especially for those who make no use of the geo enrichment features.
I was also very surprised about this behavior. Maybe an env variable to disable any outgoing traffic?
You can disable the geo database download by using the ingest.geoip.downloader.enabled
cluster setting. This should keep it from affecting your download bandwidth any further.
This feature in general seems a bit weird and very surprising. Here's a couple of observations
We have started a discussion around ways to make this easier for users. Some ideas we've had are to avoid downloading the database unless a geo ingest processor is present in the cluster state or when the first one is added. Your suggestions here are heard loud and clear, especially the points about /tmp
and the google URL's that shoulder most of the data transfer. I'm not sure how far we'd be able to take the 4th suggestion, but it's worth thinking about.
Anyone starting a basic Elasticsearch process is now downloading 40MB of a GeoIP database entirely in the background, potentially for a feature (GeoIP processing) they do not use.
In honesty, part of the reason we eagerly download the databases is because many solutions on top of Elastic do make regular use of the geo ingest plugins. From an end user perspective it is much easier to simply have the geo ingest features available at start up than to explicitly go through the steps required to ensure the geo databases are fully installed first on each deployment.
This is sort of a new-ish problem too. Historically this data could be included with a release and be used indefinitely with infrequent updates as needed, but Maxmind has since changed their license agreement to require regular updates. Hence the downloading service. Not trying to throw blame or anything, it's just the reality we live in now.
All that said, I think many would agree that it causes problems with enough deployments that it is worth looking into more quality of life improvements around the whole thing. Want to thank you again for your feedback here!
We had another discussion about this. If we only make the change to download the geoip databases when a pipeline with a geoip processor is created, it increases our risk of the pipeline being executed before the databases have been downloaded (we currently don't check or block on the databases but since we download it at startup it's a fairly safe bet that they're there). So we talked about two additions to the change described above:
Change the pipeline code to wait up to n seconds (10 maybe?)
I was thinking about this more and I wonder if it would be better to have this wait applied to creating pipelines instead of just the execution. I assume that most integrations aren't sending data before the configuration is deployed. This could help close the gap between a successful response on pipeline creation and the geo ip data being available at first set up.
would be better to have this wait applied to creating pipelines instead of just the execution
That would be good if the download is almost always fast. It would be bad if it often exceeded the request timeout (60s I think?) I don't have any information about download time distribution though. The advantage of doing it on execution is that it buys us more time for the download, without adding confusion over whether the pipeline was created. But obviously there are downsides, too.
I've added the configuration to be able to eagerly download the databases. I'm now wondering if that's good enough. Now what we have satisfies these use cases:
The only case that's not covered is case #4
above if the user does not know about this change in behavior.
Most of production infrastructure disable external web access, and applications (here elasticsearch) must use a proxy to access to internet.
I recommend to use a simple proxy like squid, setup access logging (we are using ELK for this).
So you can control access.
Or much better, provision the data file, with kubernetes:
ReadWriteOnce
, the other as ReadOnlyMany
with first PVC as data sourcemountPath: /usr/share/elasticsearch/config/ingest-geoip
# es-config.yaml
ingest.geoip.downloader.enabled: false
Description
At Intercom we test our app thousands of times per month, utilising 500 parallel jobs which run a subset of our tests and all of which run the same container. The container is a plain Docker one that we build and push to a private registry. For historical/performance reasons this image is fat i.e. while it's base image is Ruby we install MySQL, Elasticsearch, Redis etc on it too. The Elasticsearch we install is the official tar.gz release.
In Elasticsearch 7.14 a feature was added which would automatically download the MaxMind GeoIP database. Specifically, it downloads the database from Google Cloud after fetching this JSON object. This is enabled by default.
We discovered that this feature was responsible for around $20,000 on our AWS bill since we upgraded last March and based on a rough approximation of traffic we were charged for it probably cost Elastic about $5,500 💸 Investigation was slow and required enabling AWS VPC logs to assess the NAT gateway traffic source. Because the source was was so widespread (i.e. all our CI machines) we had to use
tcpdump
which was of limited value due to HTTPS and only provided us with a generic hostname. Eventually some close observation led us to the cause.This feature in general seems a bit weird and very surprising. Here's a couple of observations
/tmp
is fundamentally incompatible with Docker and drives up traffic:/tmp
is wiped.Above all, in order to do something like disable the updater, change the storage path for the database, change the endpoint used to fetch etc I need to be aware this is happening and I'm not sure that people who do not use the GeoIP features are aware.