crate / crate

CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is PostgreSQL-compatible, and based on Lucene.
https://cratedb.com/database
Apache License 2.0
4.13k stars 568 forks source link

GeoIP Lookup function #12869

Open robd003 opened 2 years ago

robd003 commented 2 years ago

Use case: It would be really useful to be able to have a generated object with geolocation info automatically created when IP addresses are inserted into an IP field.

Feature description: It would be really cool if Crate could integrate the MaxMind GeoIP client with a way to update client databases without having to restart the entire Crate cluster. This is a pretty standard GeoIP database format that multiple vendors use. There are free databases available for users to download.

Example use case:

CREATE TABLE uri_table (
   ip_addr IP,
   geoip_obj OBJECT GENERATED ALWAYS AS geoip_lookup("ip_addr")
);
SELECT geoip_lookup('18.1.127.32') as geo;
+--------------------------------------------------------------------------------+
| geo                                                                            |
+--------------------------------------------------------------------------------+
| {"continent": "North America", "country": "United States", "city": "Cambridge",|
| "state": "Massachusetts", "postal": "02139"}                                   |
+--------------------------------------------------------------------------------+

Maybe it's worth having options for which fields you want included in the response. (Example: include or exclude lat / lon co-ordinates, ISP name, Organization name, etc)

seut commented 2 years ago

Doing a network lookup to the GeoIP2 (or GeoLite2) on every single written row could dramatically decrease performance. Also possible errors like timeout's, disconnects etc. must be handle safely.

A more performant and error-safe way would be to import the desired GeoIP database as described here and using a JOIN to resolve the ip_addr field. In contrast to the schema described in this guide, the network column should be of type text in order to use our << operator, see also our documentation of the IP data type https://crate.io/docs/crate/reference/en/5.0/general/ddl/data-types.html#ip-addresses.

Example query:

SELECT * FROM uri_table, geoip2_network WHERE uri_table.ip  << geoip2_network.network;

Can you please add a reasoning why a dynamic IP resolving during writes would be a better option for your use-case?

robd003 commented 2 years ago

Doing a network lookup to the GeoIP2 (or GeoLite2) on every single written row could dramatically decrease performance. Also possible errors like timeout's, disconnects etc. must be handle safely.

The idea is to not use the webservice. MaxMind has a downloadable database that you can perform lookups with. Multiple geolocation companies use this database format.

https://github.com/maxmind/GeoIP2-java#database-usage

seut commented 2 years ago

The idea is to not use the webservice. MaxMind has a downloadable database that you can perform lookups with. Multiple geolocation companies use this database format.

https://github.com/maxmind/GeoIP2-java#database-usage

Thanks for this clarification. Even that some of their databases are free to use by end-users, their usage is licensed under a proprietary license which, as far as I know, does not allow redistribution. For redistribution one would have to obtain a commercial redistribution license.

I could imagine to provide a geoip_lookup scalar by implementing a dedicated, possible 3rd party plugin for CrateDB, which doesn't include the related databases but adds support for using a custom downloaded one. CrateDB includes a ExtraFunctions extension/plugin one can use as a blueprint to implement a scalar plugin.

I'm still interested in the benefit over importing the related database into CrateDB, would be great if you could describe it.