ClickHouse / clickhouse-docs

Official documentation for ClickHouse
https://clickhouse.com/docs
Other
117 stars 278 forks source link

Approach to IP-based Geolocation based on CIDR ranges is wrong #2304

Open astef opened 1 year ago

astef commented 1 year ago

I'm referring to this article: https://clickhouse.com/blog/geolocating-ips-in-clickhouse-and-grafana

Steps

  1. Let's follow an author in everything up to the point, when ip_trie dictionary is ready to be used. An important part is that source data includes the following lines (currently is does, but it may change):

    169.136.101.0,169.136.101.254,SG,,,Singapore (Orchard),,1.30283,103.834, 169.136.101.255,169.136.107.255,HK,Central and Western,,Hong Kong,,22.3193,114.169,

  2. Execute: select dictGet('ip_trie', ('country_code'), tuple('169.136.101.255'::IPv4))

Result:

SG

Expected results

HK

Investigation

With this query:

select 
    ip_range_start,
    ip_range_end,
    bitXor(ip_range_start, ip_range_end) as xor,
    bin(xor) as xor_binary,
    if(xor != 0, ceil(log2(xor)), 0) as unmatched,
    32-unmatched as cidr_suffix
from    geoip_url
where ip_range_start = '169.136.101.0'::IPv4 or ip_range_start = '169.136.101.255'::IPv4
169.136.101.0    169.136.101.254    254     11111110            8.0     24.0    169.136.101.0
169.136.101.255  169.136.107.255    3584    0000111000000000    12.0    20.0    169.136.96.0

We can see that in the first row's cidr_ columns we've lost the information about where exactly the SG range ends. It may look expected, since the next ip address range may be more specific and be something like 169.136.107.255/32, and everything will work, but in reality, the next ip range spans over a bigger range too.

So the second line has lost information about where the second range starts too.

From this point, all further steps will not be able to recover that information, and it's correct to say that results for such ranges will not be determined.

This problem exists not only for ranges with last byte breaks, and not only for 1 invalid IP address. It's possible to propose a theoretical example, when it will produce much bigger harm:

> ip_range_start   ip_range_end     xor     xor_binary                unmatched   cidr_suffix   cidr_address
> 169.136.0.0      169.136.254.255  65279   1111111011111111          16.0        16.0          169.136.0.0
> 169.136.255.0    169.140.255.255  262399  000001000000000011111111  19.0        13.0          169.136.0.0

In this case, the entire 169.136.255.0/24 subnet was assigned to the wrong, and potentially very distant location.

ArctypeZach commented 1 year ago

Thanks @astef - will take a look

alexey-milovidov commented 6 months ago

@ArctypeZach Let's take a look.

ArctypeZach commented 6 months ago

@astef sorry for the delay. I think I've written a new query that should yield much more accurate results:

with 
    toUInt64(ip_range_start) as int_start,
    bitXor(ip_range_start, ip_range_end) as
    xor,
    bin(xor) as xor_binary,
    if(xor != 0,ceil(log2(xor)),0) as unmatched,
    range(32 - toUInt8(unmatched) + 1, 33, 1) as cidrs,
    arrayMap(x -> (toUInt64(pow(2, 32 - x))), cidrs) as cidr_length,
    arrayJoin(arrayZip(cidrs, cidr_length)) as cidr_info,
    cidr_info.1 as suffix,
    cidr_info.2 as length,
    sum(length) over w as block_offset,
    toIPv4(int_start + block_offset) as next_block_start,
    toIPv4(toUInt64(next_block_start::UInt64 -1)) as block_end,
    toIPv4(bitAnd(bitNot(pow(2, 32 - suffix)),next_block_start)::UInt64) as block_start
select
    ip_range_start,
    ip_range_end,
    block_start,
    block_end,
    suffix,
    concat(toString(block_start),'/',toString(suffix)) as cidr
from
    geoip_url 
window w as (partition by ip_range_start order by suffix asc)
qualify toUInt64(block_end) <= toUInt64(ip_range_end)

Using one of your addresses from above, 169.136.101.0 -> 169.136.101.254, we can see some better results:

   ┌─ip_range_start─┬─ip_range_end────┬─block_start─────┬─block_end───────┬─suffix─┬─cidr───────────────┐
1. │ 169.136.101.0  │ 169.136.101.254 │ 169.136.101.0   │ 169.136.101.127 │     25 │ 169.136.101.0/25   │
2. │ 169.136.101.0  │ 169.136.101.254 │ 169.136.101.128 │ 169.136.101.191 │     26 │ 169.136.101.128/26 │
3. │ 169.136.101.0  │ 169.136.101.254 │ 169.136.101.192 │ 169.136.101.223 │     27 │ 169.136.101.192/27 │
4. │ 169.136.101.0  │ 169.136.101.254 │ 169.136.101.224 │ 169.136.101.239 │     28 │ 169.136.101.224/28 │
5. │ 169.136.101.0  │ 169.136.101.254 │ 169.136.101.240 │ 169.136.101.247 │     29 │ 169.136.101.240/29 │
6. │ 169.136.101.0  │ 169.136.101.254 │ 169.136.101.248 │ 169.136.101.251 │     30 │ 169.136.101.248/30 │
7. │ 169.136.101.0  │ 169.136.101.254 │ 169.136.101.252 │ 169.136.101.253 │     31 │ 169.136.101.252/31 │
8. │ 169.136.101.0  │ 169.136.101.254 │ 169.136.101.254 │ 169.136.101.254 │     32 │ 169.136.101.254/32 │
   └────────────────┴─────────────────┴─────────────────┴─────────────────┴────────┴────────────────────┘

This would appear to line up with the ARIN CIDR Calculator as well: Screenshot 2024-05-13 at 6 38 24 PM

wdyt? I can adjust the blog post if we feel that this is now correct.

ArctypeZach commented 6 months ago

One addendum - to address the original reproduction. I loaded the contents of this revised query into the same dictionary.

select dictGet('ip_trie', ('country_code'), tuple('169.136.101.255'::IPv4)) as country

Now yields

   ┌─country─┐
1. │ HK      │
   └─────────┘