Disable/Re-visit MaxMind GeoIP automatic downloading

dannyfallon commented 2 years ago

Description

At Intercom we test our app thousands of times per month, utilising 500 parallel jobs which run a subset of our tests and all of which run the same container. The container is a plain Docker one that we build and push to a private registry. For historical/performance reasons this image is fat i.e. while it's base image is Ruby we install MySQL, Elasticsearch, Redis etc on it too. The Elasticsearch we install is the official tar.gz release.

In Elasticsearch 7.14 a feature was added which would automatically download the MaxMind GeoIP database. Specifically, it downloads the database from Google Cloud after fetching this JSON object. This is enabled by default.

We discovered that this feature was responsible for around $20,000 on our AWS bill since we upgraded last March and based on a rough approximation of traffic we were charged for it probably cost Elastic about $5,500 💸 Investigation was slow and required enabling AWS VPC logs to assess the NAT gateway traffic source. Because the source was was so widespread (i.e. all our CI machines) we had to use tcpdump which was of limited value due to HTTPS and only provided us with a generic hostname. Eventually some close observation led us to the cause.

This feature in general seems a bit weird and very surprising. Here's a couple of observations

Anyone starting a basic Elasticsearch process is now downloading 40MB of a GeoIP database entirely in the background, potentially for a feature (GeoIP processing) they do not use.
The persistence of the database to /tmp is fundamentally incompatible with Docker and drives up traffic:
1. If I start/stop my container several times I'm downloading that database several times.
2. If I decide to start Elasticsearch during my image build (to download the file once, and put it into my image) I need to be cognisant that I've to change the path too, because /tmp is wiped.
Not fronting the URLs the database is fetched from and using a generic Google URL made this tricky to find.
Support for other cloud providers (i.e. an AWS S3 mirror) and some intelligence around picking which URLs to give to an ES server requesting the database could mitigate a lot of customer cost. This is a bit of a can of worms - who do you support, how do you detect those etc. However it would have mitigated the cost impact for both Intercom and Elastic, and potentially others (e.g. folks using CircleCI + ES).

Above all, in order to do something like disable the updater, change the storage path for the database, change the endpoint used to fetch etc I need to be aware this is happening and I'm not sure that people who do not use the GeoIP features are aware.

mitar commented 2 years ago

I can just second this. I was also very surprised about this behavior. Maybe an env variable to disable any outgoing traffic?

But this should be opt-in.

BTW, is there any other such traffic from ES? Phoning home maybe?

elasticsearchmachine commented 2 years ago

Pinging @elastic/es-data-management (Team:Data Management)

jbaiera commented 2 years ago

Heya, thanks for opening an issue for this. We agree this can be pretty annoying to have to deal with, especially for those who make no use of the geo enrichment features.

I was also very surprised about this behavior. Maybe an env variable to disable any outgoing traffic?

You can disable the geo database download by using the ingest.geoip.downloader.enabled cluster setting. This should keep it from affecting your download bandwidth any further.

This feature in general seems a bit weird and very surprising. Here's a couple of observations

We have started a discussion around ways to make this easier for users. Some ideas we've had are to avoid downloading the database unless a geo ingest processor is present in the cluster state or when the first one is added. Your suggestions here are heard loud and clear, especially the points about /tmp and the google URL's that shoulder most of the data transfer. I'm not sure how far we'd be able to take the 4th suggestion, but it's worth thinking about.

Anyone starting a basic Elasticsearch process is now downloading 40MB of a GeoIP database entirely in the background, potentially for a feature (GeoIP processing) they do not use.

In honesty, part of the reason we eagerly download the databases is because many solutions on top of Elastic do make regular use of the geo ingest plugins. From an end user perspective it is much easier to simply have the geo ingest features available at start up than to explicitly go through the steps required to ensure the geo databases are fully installed first on each deployment.

This is sort of a new-ish problem too. Historically this data could be included with a release and be used indefinitely with infrequent updates as needed, but Maxmind has since changed their license agreement to require regular updates. Hence the downloading service. Not trying to throw blame or anything, it's just the reality we live in now.

All that said, I think many would agree that it causes problems with enough deployments that it is worth looking into more quality of life improvements around the whole thing. Want to thank you again for your feedback here!

masseyke commented 1 year ago

We had another discussion about this. If we only make the change to download the geoip databases when a pipeline with a geoip processor is created, it increases our risk of the pipeline being executed before the databases have been downloaded (we currently don't check or block on the databases but since we download it at startup it's a fairly safe bet that they're there). So we talked about two additions to the change described above:

Add a new configuration telling us to eagerly download the database. The default will be to not do so (setting it to eagerly download will be equivalent to what we have now).
Change the pipeline code to wait up to n seconds (10 maybe?) if the geoip database is not present and a geoip processor is called. If that time expires, do whatever it is we do now (I believe we just return no result but need to confirm).

jbaiera commented 1 year ago

Change the pipeline code to wait up to n seconds (10 maybe?)

I was thinking about this more and I wonder if it would be better to have this wait applied to creating pipelines instead of just the execution. I assume that most integrations aren't sending data before the configuration is deployed. This could help close the gap between a successful response on pipeline creation and the geo ip data being available at first set up.

masseyke commented 1 year ago

would be better to have this wait applied to creating pipelines instead of just the execution

That would be good if the download is almost always fast. It would be bad if it often exceeded the request timeout (60s I think?) I don't have any information about download time distribution though. The advantage of doing it on execution is that it buys us more time for the download, without adding confusion over whether the pipeline was created. But obviously there are downsides, too.

masseyke commented 1 year ago

I've added the configuration to be able to eagerly download the databases. I'm now wondering if that's good enough. Now what we have satisfies these use cases:

Users who are running Elasticsearch in CI. If their clusters don't have any pipelines with geoip processors they won't pull down the databases, so download costs won't be unexpectedly high.
Users who are trying out the geoip processor, manually creating pipelines, will have the
Existing users who have geoip pipelines and are upgrading will still see the geoip databases downloaded on startup.
Users building automation for new clusters can choose to set the eager download flag so that geoip databases are downloaded on startup and ready by the time pipelines are created and data is ingested.

The only case that's not covered is case #4 above if the user does not know about this change in behavior.

ebuildy commented 1 year ago

Most of production infrastructure disable external web access, and applications (here elasticsearch) must use a proxy to access to internet.

I recommend to use a simple proxy like squid, setup access logging (we are using ELK for this).

So you can control access.

Or much better, provision the data file, with kubernetes:

Create 2 PVC, one as ReadWriteOnce, the other as ReadOnlyMany with first PVC as data source
Create a k8s job that Download the DB from maxmind, save into the ReadWriteOnce PVC
Mount the ReadOnlyMany PVC into elasticsearch pod and disable geoip download into elasticsearch config

mountPath: /usr/share/elasticsearch/config/ingest-geoip

# es-config.yaml
ingest.geoip.downloader.enabled: false

elastic / elasticsearch

Disable/Re-visit MaxMind GeoIP automatic downloading #90673

Description