Open astef opened 1 year ago
Thanks @astef - will take a look
@ArctypeZach Let's take a look.
@astef sorry for the delay. I think I've written a new query that should yield much more accurate results:
with
toUInt64(ip_range_start) as int_start,
bitXor(ip_range_start, ip_range_end) as
xor,
bin(xor) as xor_binary,
if(xor != 0,ceil(log2(xor)),0) as unmatched,
range(32 - toUInt8(unmatched) + 1, 33, 1) as cidrs,
arrayMap(x -> (toUInt64(pow(2, 32 - x))), cidrs) as cidr_length,
arrayJoin(arrayZip(cidrs, cidr_length)) as cidr_info,
cidr_info.1 as suffix,
cidr_info.2 as length,
sum(length) over w as block_offset,
toIPv4(int_start + block_offset) as next_block_start,
toIPv4(toUInt64(next_block_start::UInt64 -1)) as block_end,
toIPv4(bitAnd(bitNot(pow(2, 32 - suffix)),next_block_start)::UInt64) as block_start
select
ip_range_start,
ip_range_end,
block_start,
block_end,
suffix,
concat(toString(block_start),'/',toString(suffix)) as cidr
from
geoip_url
window w as (partition by ip_range_start order by suffix asc)
qualify toUInt64(block_end) <= toUInt64(ip_range_end)
Using one of your addresses from above, 169.136.101.0 -> 169.136.101.254
, we can see some better results:
┌─ip_range_start─┬─ip_range_end────┬─block_start─────┬─block_end───────┬─suffix─┬─cidr───────────────┐
1. │ 169.136.101.0 │ 169.136.101.254 │ 169.136.101.0 │ 169.136.101.127 │ 25 │ 169.136.101.0/25 │
2. │ 169.136.101.0 │ 169.136.101.254 │ 169.136.101.128 │ 169.136.101.191 │ 26 │ 169.136.101.128/26 │
3. │ 169.136.101.0 │ 169.136.101.254 │ 169.136.101.192 │ 169.136.101.223 │ 27 │ 169.136.101.192/27 │
4. │ 169.136.101.0 │ 169.136.101.254 │ 169.136.101.224 │ 169.136.101.239 │ 28 │ 169.136.101.224/28 │
5. │ 169.136.101.0 │ 169.136.101.254 │ 169.136.101.240 │ 169.136.101.247 │ 29 │ 169.136.101.240/29 │
6. │ 169.136.101.0 │ 169.136.101.254 │ 169.136.101.248 │ 169.136.101.251 │ 30 │ 169.136.101.248/30 │
7. │ 169.136.101.0 │ 169.136.101.254 │ 169.136.101.252 │ 169.136.101.253 │ 31 │ 169.136.101.252/31 │
8. │ 169.136.101.0 │ 169.136.101.254 │ 169.136.101.254 │ 169.136.101.254 │ 32 │ 169.136.101.254/32 │
└────────────────┴─────────────────┴─────────────────┴─────────────────┴────────┴────────────────────┘
This would appear to line up with the ARIN CIDR Calculator as well:
wdyt? I can adjust the blog post if we feel that this is now correct.
One addendum - to address the original reproduction. I loaded the contents of this revised query into the same dictionary.
select dictGet('ip_trie', ('country_code'), tuple('169.136.101.255'::IPv4)) as country
Now yields
┌─country─┐
1. │ HK │
└─────────┘
I'm referring to this article: https://clickhouse.com/blog/geolocating-ips-in-clickhouse-and-grafana
Steps
ip_trie
dictionary is ready to be used. An important part is that source data includes the following lines (currently is does, but it may change):select dictGet('ip_trie', ('country_code'), tuple('169.136.101.255'::IPv4))
Result:
SG
Expected results
HK
Investigation
With this query:
We can see that in the first row's
cidr_
columns we've lost the information about where exactly theSG
range ends. It may look expected, since the next ip address range may be more specific and be something like169.136.107.255/32
, and everything will work, but in reality, the next ip range spans over a bigger range too.So the second line has lost information about where the second range starts too.
From this point, all further steps will not be able to recover that information, and it's correct to say that results for such ranges will not be determined.
This problem exists not only for ranges with last byte breaks, and not only for 1 invalid IP address. It's possible to propose a theoretical example, when it will produce much bigger harm:
In this case, the entire
169.136.255.0/24
subnet was assigned to the wrong, and potentially very distant location.