Update IP address database for geocoding

javitonino commented 7 years ago

From https://github.com/CartoDB/bigmetadata/issues/359

As far as I can see (i.e., onprem), it looks our IP address database dates back to 2014. As per our current contract and licensing model, we should be able to update and get the newest one.

We don't have any specific requests to do this, I just think 3 years is a lot of time ;-)

cc @danicarrion @hannahblue

jaakla commented 6 years ago

Currently (with 8-Mar-18 GeoLite2 ) source data has about 4M rows, our DB has 3M, so 25% is 'missing'. Considerations:

It seems to be be quite easy to create new version to S3 and data-services database, old scripts are here: https://github.com/CartoDB/data-services/tree/master/geocoder/ip-addresses
Some table structure changes may have happened, some minor (?) tweaks may be needed
Maxmind is deprecating their open source table, they stop their updates by 1-Apr-2018, see announcement.
We should check their commercial offering (should have also better accuracy), and some competitors also (e.g. IP2Location™, https://www.digitalelement.com/ etc).
Current usage (kibana&_a=(columns:!(_source),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-',key:program,negate:!f,value:dataservices_api),query:(match:(program:(query:dataservices_api,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-',key:function_name,negate:!f,value:cdb_geocode_ipaddress_point),query:(match:(function_name:(query:cdb_geocode_ipaddress_point,type:phrase))))),index:'logstash-',interval:d,query:(query_string:(analyze_wildcard:!t,query:'')),sort:!('@timestamp',desc)))): last 30 days ~126K geocodes, very uneven, many days 0, max day 40K

juanignaciosl commented 6 years ago

@jaakla GeoLite2-City-Blocks-IPv4.csv has 3060528 rows (~3M), not ~4M. I downloaded it here as explained in the instructions. Am I looking at the wrong file?

jaakla commented 6 years ago

It may be I summed ipv4 and ipv6 tables, don't recall the detail :(. Strangely the 1 month old version and latest one have quite different numbers for both csv files:

// older:
:GeoLite2-City-CSV_20180306 jaak$ wc -l GeoLite2-City-Blocks-IPv4.csv
 2662733 GeoLite2-City-Blocks-IPv4.csv
:GeoLite2-City-CSV_20180306 jaak$ wc -l GeoLite2-City-Blocks-IPv6.csv
 2039865 GeoLite2-City-Blocks-IPv6.csv
// updated:
:GeoLite2-City-CSV_20180403 jaak$ wc -l GeoLite2-City-Blocks-IPv4.csv
 3060529 GeoLite2-City-Blocks-IPv4.csv
:GeoLite2-City-CSV_20180403 jaak$ wc -l GeoLite2-City-Blocks-IPv6.csv
  275449 GeoLite2-City-Blocks-IPv6.csv

juanignaciosl commented 6 years ago

Ok, I'll update the table with the new one.

juanignaciosl commented 6 years ago

Ok, I'm back into this after some weeks away, sorry about that.

The scripts and information at the repository seem completely outdated. For example, Creation steps lead to an IP4 table. Nevertheless, production geocoder database is IP6-only:

cdb_geocoder_server=# select family(network_start_ip) ff, count(1) from ip_address_locations group by ff;
 ff |  count
----+---------
  6 | 2963525
(1 row)

It works for IP4 addresses because geocode_ip function transforms incoming addresses:

https://github.com/CartoDB/data-services/blob/481dbdfe3ed9e2f1622f5855e0b326fb6108ddb2/geocoder/extension/sql/40_ipaddr.sql#L16-L22

In addition, instructions assume that geocoder DB is a CARTO user database. That's no longer true, as we have several clouds and the database is replicated.

I'll dig deeper and come up with a slightly more reproduceable way.

juanignaciosl commented 6 years ago

Ok, I've written down the updated process (see #240).

Some raw numbers:

-- New data
dataservices_db=# select count(1) from ip_address_locations;
  count
---------
 3335976
(1 row)

-- Old data
dataservices_db=# select count(1) from ip_address_locations_backup;
  count
---------
 2963525
(1 row)

-- IPs in both tables
dataservices_db=# select count(1) from (select network_start_ip from ip_address_locations intersect select network_start_ip from ip_address_locations_backup) x;
  count
---------
 1487680
(1 row)

-- New IPs
dataservices_db=# select count(1) from (select network_start_ip from ip_address_locations except select network_start_ip from ip_address_locations_backup) x;
  count
---------
 1848296
(1 row)

-- Lost IPs
dataservices_db=# select count(1) from (select network_start_ip from ip_address_locations_backup except select network_start_ip from ip_address_locations) x;
  count
---------
 1475845
(1 row)

Losing 1475845 IPs makes no sense. Let's take a look at what's going on here:

dataservices_db=# select * from (select network_start_ip from ip_address_locations_backup except select network_start_ip from ip_address_locations) x limit 20;
 network_start_ip
------------------
 ::ffff:1.0.5.0
 ::ffff:1.0.6.0
 ::ffff:1.0.129.0
 ::ffff:1.0.130.0
 ::ffff:1.0.133.0
 ::ffff:1.0.145.0
 ::ffff:1.0.146.0
 ::ffff:1.0.148.0
 ::ffff:1.0.149.0
 ::ffff:1.0.150.0
 ::ffff:1.0.152.0
 ::ffff:1.0.156.0
 ::ffff:1.0.158.0
 ::ffff:1.0.159.0
 ::ffff:1.0.162.0
 ::ffff:1.0.163.0
 ::ffff:1.0.164.0
 ::ffff:1.0.166.0
 ::ffff:1.0.167.0
 ::ffff:1.0.174.0
(20 rows)

I've reviewed more missing IPs and there's no clear pattern. Here's the dump: https://www.dropbox.com/s/nwy3hldqtl0ow7k/ip_address_locations.dump.zip?dl=0

If I'm not wrong, replacing the old file with the new one might mean losing almost half of the previously geocoded IPs.

@jaakla , what do you think? Am I missing something?

cc @ethervoid , just in case he sees something I don't.

juanignaciosl commented 6 years ago

I've tested a couple of the missing ones against a couple of free IP geocoder webpages. They return something similar to the old values.

~I think that returning "the most recent known location" is better than returning nothing, specially because that'd lead to regressions to some users (as some points would become null). I propose keeping the old values that are missing in the new table. ~Maybe a column with the year of the source can be added~ We have the creation date column, so we could eventually use it as a quality indicator. What do you think, @hannahblue ?~

juanignaciosl commented 6 years ago

I've changed my mind after seeing more data and taking into account that we don't only geocode by exact coincidence but by nearest subnetwork:

We could just update them and see. Most of the missing results will be handled by getting the nearest subnetwork. Examples:

-- New values 2.216.185* subnetwork
# select network_start_ip, st_astext(the_geom) from ip_address_locations where network_start_ip::text like '::ffff:2.216.185%';
   network_start_ip   |       st_astext
----------------------+------------------------
 ::ffff:2.216.185.0   | POINT(-0.1122 53.4904)
 ::ffff:2.216.185.128 | POINT(-0.7408 53.5774)

-- Old values 2.216.185* subnetwork
# select network_start_ip, st_astext(the_geom) from ip_address_locations_backup where network_start_ip::text like '::ffff:2.216.185%';
   network_start_ip   |       st_astext
----------------------+------------------------
 ::ffff:2.216.185.0   | POINT(-1.35 53.4333)
 ::ffff:2.216.185.192 | POINT(-1.35 53.4333)
 ::ffff:2.216.185.128 | POINT(-1.35 53.4333)
 ::ffff:2.216.185.160 | POINT(-1.35 53.4333)
 ::ffff:2.216.185.184 | POINT(-1.35 53.4333)
 ::ffff:2.216.185.176 | POINT(-1.35 53.4333)
 ::ffff:2.216.185.182 | POINT(-1.35 53.4333)
 ::ffff:2.216.185.180 | POINT(-1.35 53.4333)
 ::ffff:2.216.185.181 | POINT(-0.8167 51.3667)
(9 rows)

For example, this is the result of geocoding the following IPs:

2.216.185.0 --> -1.35, 53.4333 2.216.185.1 --> -1.35, 53.4333 2.216.185.128 --> -1.35, 53.4333 2.216.185.180 --> -1.35, 53.4333 2.216.185.181 --> -0.8167, 51.3667 2.216.185.182 --> -1.35, 53.4333

jaakla commented 6 years ago

I think it is ok plan, advantage is to have also clear source database version. It is not accurate database in any case and if someone uses then for statistical purposes where approximation is accepted.

For particular case I would consider -1.35 53.4333 more probable correct location than the others, just looking at map, but that's purely speculative.

I try to figure why ipv6 table decreased from March to April almost 10x (from 2M to 270K rows), I imagine they may have done some subnet "compressions".

p.s. this gave me idea to add one more metadata column to geocoded table: geocoding source data updated date, as with different base tables/services you may get different locations and this helps troubleshooting. For online geocoders this would be date of geocoding request. So whole geocoding metadata may have something like "provider: tomtom, datum: 05.10.2018, precision: rooftop". Just something to consider for https://github.com/CartoDB/cartodb/issues/12371

jaakla commented 6 years ago

The number of lost IP-s is similar to recent loss in specific city ipv6 location table, happened between March and April this year (less 1.7M: from ~2M to ~200K), so maybe this is the reason? Here is my copy of last 3 month data files. In this sense it may be valuable to use March data instead of newer April or May. In the other hand we would have own interpretation of maxmind then; we don't know reason of this data removal - maybe it was complete garbage (but the map does not suggest this), illegal data or whatnot.

The visual comparison suggests that older (March) table did have many more locations, it was not just some technical compression of subnets:

juanignaciosl commented 6 years ago

Thanks, @jaakla ! I'll use your copy of March data!

juanignaciosl commented 6 years ago

Repetition of my previous "Some raw numbers" comment. ip_address_locations is the new one with May ip4 and March ip6, and ip_address_locations_backup is the old one, so we can compare the effect of switching (see comments):

-- Total count of the new dataset
dataservices_db=# select count(1) from ip_address_locations;
  count
---------
 5113274
(1 row)

-- Previous dataset count
dataservices_db=# select count(1) from ip_address_locations_backup;
  count
---------
 2963525
(1 row)

-- Matching IPs. Using March is just an slightly improvement over May
dataservices_db=# select count(1) from (select network_start_ip from ip_address_locations intersect select network_start_ip from ip_address_locations_backup) x;
  count
---------
 1487905
(1 row)

-- A lot of new IPs
dataservices_db=# select count(1) from (select network_start_ip from ip_address_locations except select network_start_ip from ip_address_locations_backup) x;
  count
---------
 3625369
(1 row)

-- Again, a lot of loss
dataservices_db=# select count(1) from (select network_start_ip from ip_address_locations_backup except select network_start_ip from ip_address_locations) x;
  count
---------
 1475620
(1 row)

As your research showed, @jaakla , March dataset is likely to be better, but it doesn't fix the loss of IPs. Anyway, in order to do a proper comparation,

juanignaciosl commented 6 years ago

Production geocoders updated (CartoDB/data-services/pull/240).

CartoDB / data-services

Update IP address database for geocoding #231