.each method always returns empty Enumerable for GeoIPCity.dat database

cjheath / geoip

The Ruby gem for querying Maxmind.com's GeoIP database, which returns the geographic location of a server given its IP address

http://geoip.rubyforge.org/

GNU Lesser General Public License v2.1

713 stars 83 forks source link

.each method always returns empty Enumerable for GeoIPCity.dat database #62

Open mtowers opened 8 years ago

mtowers commented 8 years ago

Using a recently downloaded GeoIPCity.dat legacy database, calling the each method always returns an Enumerable with zero elements. However, if you use the GeoLiteCity.dat database, a non-empty Enumerable is returned.

#!/usr/bin/env ruby

require 'geoip'  # v 1.6.1

@maxmind = GeoIP.new("./GeoLiteCity.dat")

puts "Cities found in GeoLiteCity.dat:  #{@maxmind.each.count}"

@maxmind = GeoIP.new("./GeoIPCity.dat")

puts "Cities found in GeoIPCity.dat:  #{@maxmind.each.count}"

Cities found in GeoLiteCity.dat:  12098
Cities found in GeoIPCity.dat:  0

mtowers commented 8 years ago

FWIW, the .each_by_ip method does return a populated Enumerable.

cjheath commented 8 years ago

Sorry I haven't had any time to investigate yet. Have you looked in the code? There's not much there. PR welcome.

mtowers commented 8 years ago

I'm still poking on it but I've not been able to isolate the exact problem.

mtowers commented 8 years ago

I'm still not sure why, but the empty Enumerable is returned because the first call to read_city is passing an invalid parameter to atomic_read for the file offset. The offset being passed is greater than the length of the file. This condition manifests with a copy of GeoIPCity.dat downloaded from MaxMind on 8/3/16. It does not manifest with my copy of GeoLiteCity.dat, downloaded 8/9/16.

record = atomic_read(FULL_RECORD_LENGTH, offset+index_size)

Here's what I'm seeing in my debugger:

    762:   def read_city(offset, hostname = '', ip = '') #:nodoc:
    763: require 'pry-byebug'
    764: binding.pry
 => 765:     return nil if offset == 0
    766:     record = atomic_read(FULL_RECORD_LENGTH, offset+index_size)
    767:     return unless (record && record.size == FULL_RECORD_LENGTH)
    768: 
    769:     # The country code is the first byte:
    770:     code = record[0]

[1] pry(#<GeoIP>)> offset
=> 10286227
[2] pry(#<GeoIP>)> index_size
=> 61717356
[3] pry(#<GeoIP>)> offset+index_size
=> 72003583
[4] pry(#<GeoIP>)> @file.size
=> 66865539

mtowers commented 8 years ago

One more clue. Changing line 764 in read_city from:

record = atomic_read(FULL_RECORD_LENGTH, offset+index_size)

to:

record = atomic_read(FULL_RECORD_LENGTH, @record_length*2*offset)

will allow the code to traverse some number of city records, although some fields are not populated correctly.

mtowers commented 8 years ago

@cjheath Without a spec of the binary format of the database, I'm not sure how to proceed further diagnosing this problem. I contacted MaxMind and asked if a spec was available and apparently there is not. They referred me to this project as a reference but I've not been able to discover anything new from looking at it. https://github.com/appliedsec/pygeoip

Any ideas?

cjheath commented 8 years ago

I originally wrote the geoip by following this https://github.com/maxmind/geoip-api-c. I assume they will have updated it to work with whatever they've done to the file format, so that would be my first port of call. I won't be able to do it this week or next, however.

cjheath commented 8 years ago

_each_byip does a recursive descent of the index tree, which starts at the start of the file and ends at @database_segments[0] - see the _indexlength method. The index tree is a binary search tree with nodes containing two offset values, each of @record_length (3 or 4) bytes.

each on the other hand walks through the city records which is presumed to start one byte directly after the index tree.

The end of the file has a number of 4-byte values ending with a 3-byte value. One of these four-byte values contains '\xFF\xFF\xFF' and a byte containing the database type. We search back from the end of the file to find this type number, which initialises the file format configuration variables. Note that this is not documented; it's just what I copied from the early C library code.

Given this, it seems most likely that the fix will come in detect_database_type!. See if you can tell what's going on from what I described here.

cjheath commented 8 years ago

Possibly one or the other four-byte configuration values in the tail of the file contains alternate instructions about how to locate the city records, and they've started making use of that?

mtowers commented 8 years ago

Finally circling back to this...

It also appears that the each method is also not fully iterating over all city records in the GeoLiteCity.dat database(gzip). The each method prints progress every 1000 records. When I call that method, the last output I see is 25000: 3625134, which implies < 26,000 records iterated over.

However, if you look at the CSV city list(zip), it contains 811,675 records.