Parsing all IRR data - Githubissues

SichangHe commented 1 year ago

After tweaking and fixing many errors, we have this log from parsing all IRR data in lowest verbosity: parse_all_log.txt

The input is 6.9G, the output 332M.

SichangHe commented 1 year ago

Interestingly, most of the DB are encoded in either UTF-8 or ASCII, but some use a latin-1 variant. And APNIC, being Asian, uses GBK.

``` [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Guessing that ../data/irrs/apnic.db.limerick is encoded in UTF-8 after reading all 1906 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Guessing that ../data/irrs/panix.db is encoded in UTF-8 after reading all 10770 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Detected that ../data/irrs/lacnic.db is encoded in UTF-8 after reading 20480 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Detected that ../data/irrs/apnic.db.role is encoded in UTF-8 after reading 131072 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Guessing that ../data/irrs/openface.db is encoded in UTF-8 after reading all 8959 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Guessing that ../data/irrs/nestegg.db is encoded in UTF-8 after reading all 2292 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Detected that ../data/irrs/apnic.db.mntner is encoded in UTF-8 after reading 150528 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Detected that ../data/irrs/canarie.db is encoded in windows-1252 after reading 384000 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Detected that ../data/irrs/ripe.db is encoded in windows-1254 after reading 151552 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Detected that ../data/irrs/apnic.db.irt is encoded in UTF-8 after reading 308224 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Guessing that ../data/irrs/rgnet.db is encoded in UTF-8 after reading all 14454 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Detected that ../data/irrs/radb.db is encoded in windows-1252 after reading 542720 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Guessing that ../data/irrs/apnic.db.key-cert is encoded in UTF-8 after reading all 741415 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Guessing that ../data/irrs/apnic.db.rtr-set is encoded in UTF-8 after reading all 1572 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Guessing that ../data/irrs/apnic.db.as-block is encoded in UTF-8 after reading all 126873 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Detected that ../data/irrs/apnic.db.as-set is encoded in UTF-8 after reading 296960 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Guessing that ../data/irrs/apnic.db.route-set is encoded in UTF-8 after reading all 176925 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Detected that ../data/irrs/apnic.db.inetnum is encoded in GBK after reading 570368 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Guessing that ../data/irrs/apnic.db.filter-set is encoded in UTF-8 after reading all 5056 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Detected that ../data/irrs/tc.db is encoded in UTF-8 after reading 223232 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Detected that ../data/irrs/apnic.db.organisation is encoded in windows-1250 after reading 758784 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Detected that ../data/irrs/jpirr.db is encoded in UTF-8 after reading 43008 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Detected that ../data/irrs/idnic.db is encoded in UTF-8 after reading 3016704 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Detected that ../data/irrs/altdb.db is encoded in UTF-8 after reading 3811328 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Detected that ../data/irrs/arin.db is encoded in UTF-8 after reading 806912 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Guessing that ../data/irrs/bell.db is encoded in UTF-8 after reading all 4246874 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Guessing that ../data/irrs/host.db is encoded in UTF-8 after reading all 895 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Detected that ../data/irrs/apnic.db.aut-num is encoded in UTF-8 after reading 533504 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Detected that ../data/irrs/bboi.db is encoded in windows-1252 after reading 208896 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Detected that ../data/irrs/afrinic.db is encoded in UTF-8 after reading 634880 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Guessing that ../data/irrs/apnic.db.peering-set is encoded in UTF-8 after reading all 12457 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Guessing that ../data/irrs/apnic.db.inet-rtr is encoded in UTF-8 after reading all 6845 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Detected that ../data/irrs/apnic.db.route is encoded in windows-1252 after reading 11406336 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Guessing that ../data/irrs/reach.db is encoded in UTF-8 after reading all 16012113 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Detected that ../data/irrs/level3.db is encoded in UTF-8 after reading 17600512 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Detected that ../data/irrs/apnic.db.route6 is encoded in UTF-8 after reading 23914496 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Guessing that ../data/irrs/apnic.db.domain is encoded in UTF-8 after reading all 90153400 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Detected that ../data/irrs/apnic.db.inet6num is encoded in UTF-8 after reading 624640 bytes. [2023-06-15T12:06:41Z DEBUG route_policy_cmp] Detected that ../data/irrs/nttcom.db is encoded in UTF-8 after reading 162819072 bytes. ```

I have implemented encoding detection so that all decoding should be correct.

SichangHe commented 1 year ago

Updated log after always using the correct encoding. parse_all_log.txt

SichangHe / internet_route_verification

Parsing all IRR data #17