Open mtowers opened 8 years ago
FWIW, the .each_by_ip
method does return a populated Enumerable
.
Sorry I haven't had any time to investigate yet. Have you looked in the code? There's not much there. PR welcome.
I'm still poking on it but I've not been able to isolate the exact problem.
I'm still not sure why, but the empty Enumerable
is returned because the first call to read_city
is passing an invalid parameter to atomic_read
for the file offset. The offset being passed is greater than the length of the file. This condition manifests with a copy of GeoIPCity.dat
downloaded from MaxMind on 8/3/16. It does not manifest with my copy of GeoLiteCity.dat
, downloaded 8/9/16.
record = atomic_read(FULL_RECORD_LENGTH, offset+index_size)
Here's what I'm seeing in my debugger:
762: def read_city(offset, hostname = '', ip = '') #:nodoc:
763: require 'pry-byebug'
764: binding.pry
=> 765: return nil if offset == 0
766: record = atomic_read(FULL_RECORD_LENGTH, offset+index_size)
767: return unless (record && record.size == FULL_RECORD_LENGTH)
768:
769: # The country code is the first byte:
770: code = record[0]
[1] pry(#<GeoIP>)> offset
=> 10286227
[2] pry(#<GeoIP>)> index_size
=> 61717356
[3] pry(#<GeoIP>)> offset+index_size
=> 72003583
[4] pry(#<GeoIP>)> @file.size
=> 66865539
One more clue. Changing line 764 in read_city
from:
record = atomic_read(FULL_RECORD_LENGTH, offset+index_size)
to:
record = atomic_read(FULL_RECORD_LENGTH, @record_length*2*offset)
will allow the code to traverse some number of city records, although some fields are not populated correctly.
@cjheath Without a spec of the binary format of the database, I'm not sure how to proceed further diagnosing this problem. I contacted MaxMind and asked if a spec was available and apparently there is not. They referred me to this project as a reference but I've not been able to discover anything new from looking at it. https://github.com/appliedsec/pygeoip
Any ideas?
I originally wrote the geoip by following this https://github.com/maxmind/geoip-api-c. I assume they will have updated it to work with whatever they've done to the file format, so that would be my first port of call. I won't be able to do it this week or next, however.
_each_byip does a recursive descent of the index tree, which starts at the start of the file and ends at @database_segments[0] - see the _indexlength method. The index tree is a binary search tree with nodes containing two offset values, each of @record_length (3 or 4) bytes.
each on the other hand walks through the city records which is presumed to start one byte directly after the index tree.
The end of the file has a number of 4-byte values ending with a 3-byte value. One of these four-byte values contains '\xFF\xFF\xFF' and a byte containing the database type. We search back from the end of the file to find this type number, which initialises the file format configuration variables. Note that this is not documented; it's just what I copied from the early C library code.
Given this, it seems most likely that the fix will come in detect_database_type!. See if you can tell what's going on from what I described here.
Possibly one or the other four-byte configuration values in the tail of the file contains alternate instructions about how to locate the city records, and they've started making use of that?
Finally circling back to this...
It also appears that the each
method is also not fully iterating over all city records in the GeoLiteCity.dat database(gzip). The each
method prints progress every 1000 records. When I call that method, the last output I see is 25000: 3625134
, which implies < 26,000 records iterated over.
However, if you look at the CSV city list(zip), it contains 811,675 records.
Using a recently downloaded
GeoIPCity.dat
legacy database, calling theeach
method always returns anEnumerable
with zero elements. However, if you use theGeoLiteCity.dat
database, a non-emptyEnumerable
is returned.