Closed SichangHe closed 1 month ago
Overview of AS Set sizes (including pseudo sets).
In [2]: df = pd.read_csv('as_set_sizes.csv.gz')
In [3]: df
Out[3]:
as_set size
0 AS-399760-CLIENTS 1
1 AS-ISIONUK 2
2 c#203447 2
3 AS-NICPROXY 7
4 AS199221:AS-199221 2
... ... ...
71341 AS-BITON 1
71342 AS-30998 8
71343 AS268235:AS-CUSTOMERS 42
71344 AS-V6 16
71345 AS-SET-LEVEL-7-INTERNET 5
[71346 rows x 2 columns]
In [4]: df.describe()
Out[4]:
size
count 71346.000000
mean 439.580089
std 4461.051650
min 0.000000
25% 1.000000
50% 1.000000
75% 5.000000
max 93709.000000
Stats for pseudo sets and real sets:
In [5]: df_w_hash = df[df['as_set'].str.contains('#')]
In [6]: df_w_hash
Out[6]:
as_set size
2 c#203447 2
5 c#22987 5
6 c#395466 1
8 c#61605 2
9 m#AS-CBL-TRANSIT#MNT-BAHA 1
... ... ...
71309 m#AS-IXBR-TRANSIT4-SP#MAINT-AS28186 1
71318 c#59239 58
71326 c#212271 2
71337 c#7795 76
71339 c#46393 1
[17970 rows x 2 columns]
In [7]: df_wo_hash = df[~df['as_set'].str.contains('#')]
In [8]: df_wo_hash
Out[8]:
as_set size
0 AS-399760-CLIENTS 1
1 AS-ISIONUK 2
3 AS-NICPROXY 7
4 AS199221:AS-199221 2
7 AS28855:AS-CUSTOMERS 1
... ... ...
71341 AS-BITON 1
71342 AS-30998 8
71343 AS268235:AS-CUSTOMERS 42
71344 AS-V6 16
71345 AS-SET-LEVEL-7-INTERNET 5
[53376 rows x 2 columns]
In [9]: df_wo_hash.describe()
Out[9]:
size
count 53376.000000
mean 584.558847
std 5149.282015
min 0.000000
25% 1.000000
50% 2.000000
75% 6.000000
max 93709.000000
Sketch histogram.
AS Sets (86MiB): as_sets.txt.gz
It seems that the Zipf distribution fitting failed!?
I've never seen anything like that before. I'd check whether the parameters passed to the function are correct... if that's not it then it may be that the data doesn't fit a Zipf distribution in the end.
From: Steven Hé (Sīchàng) @.> Sent: Monday, January 8, 2024 9:54 AM To: SichangHe/internet_route_verification @.> Cc: Subscribed @.***> Subject: Re: [SichangHe/internet_route_verification] Flatten AS Sets & getting sizes (Issue #114)
It seems that the Zipf distribution fitting failed!?
IPython history.
In [1]: import pandas as pd pdf In [2]: df = pd.read_csv('as_set_sizes.csv.gz')
In [3]: df Out[3]: as_set size 0 AS-399760-CLIENTS 1 1 AS-ISIONUK 2 2 c#203447 2 3 AS-NICPROXY 7 4 AS199221:AS-199221 2 ... ... ... 71341 AS-BITON 1 71342 AS-30998 8 71343 AS268235:AS-CUSTOMERS 42 71344 AS-V6 16 71345 AS-SET-LEVEL-7-INTERNET 5
[71346 rows x 2 columns]
In [5]: from scipy.stats import zipf, fit
In [8]: res = fit(zipf, df["size"], [(1.0, 10.0)])
In [9]: res Out[9]: params: FitParams(a=1.5713559159360042, loc=0.0) success: False message: 'Optimization converged to parameter values that are inconsistent with the data.'
— Reply to this email directly, view it on GitHubhttps://github.com/SichangHe/internet_route_verification/issues/114#issuecomment-1880290704, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AACPO5YOLIU7YF7N6XYVHPDYNNGWFAVCNFSM6AAAAABBBO4AZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBQGI4TANZQGQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>
… Not a Zipf.
Previous results were wrong due to the erroneous script (fixed in https://github.com/SichangHe/internet_route_verification/commit/e29d04dd57693a862efe401e023e33651b964a8b).
Results from corrected script, with depth: as_set_sizes2.csv.gz
In [79]: df_raw = pd.read_csv("as_set_sizes2.csv.gz")
In [80]: df = df_raw[~df_raw['as_set'].str.contains('#')]
In [81]: df
Out[81]:
as_set size depth
0 AS-399760-CLIENTS 1 0
1 AS-ISIONUK 2 0
3 AS-NICPROXY 7 0
4 AS199221:AS-199221 2 0
7 AS28855:AS-CUSTOMERS 1 0
... ... ... ...
71213 AS-BITON 1 0
71214 AS-30998 8 0
71215 AS268235:AS-CUSTOMERS 42 0
71216 AS-V6 16 0
71217 AS-SET-LEVEL-7-INTERNET 5 0
[53268 rows x 3 columns]
In [82]: df.describe()
Out[82]:
size depth
count 53268.000000 53268.000000
mean 648.176260 1.730194
std 5636.609586 13.263516
min 0.000000 0.000000
25% 1.000000 0.000000
50% 2.000000 0.000000
75% 6.000000 0.000000
max 95591.000000 299.000000
Distribution of sizes are almost unchanged, because the previous bug only impacted the few nested sets. As we can see now, most sets are flat.
In [119]: df.tail(260)
Out[119]:
as_set size depth
53008 AS3326:AS-UA 89580 40
53009 AS15562:AS-LEVEL42 1 41
53010 AS15562:AS-LEVEL43 1 42
53011 AS15562:AS-LEVEL44 1 43
53012 AS15562:AS-LEVEL45 1 44
... ... ... ...
53263 AS15562:AS-LEVEL296 1 295
53264 AS15562:AS-LEVEL297 1 296
53265 AS15562:AS-LEVEL298 1 297
53266 AS15562:AS-LEVEL299 1 298
53267 AS15562:AS-LEVEL300 1 299
[260 rows x 3 columns]
It seems that AS15562 just have a chain of AS Sets. Though, they are rather the anomaly.
The Zipf distribution fits after removing sets with no members: https://github.com/SichangHe/internet_route_verification_meta/commit/cf00b74cd7b06d3f4758aa0b29918adf35c618f5
In [122]: res = fit(zipf, df[df["size"] > 0]["size"], [(1.0, 10.0)])
...: print(res)
params: FitParams(a=1.498530833262867, loc=0.0)
success: True
message: 'Optimization terminated successfully.'
Depths also fit: https://github.com/SichangHe/internet_route_verification_meta/commit/f3a25d00477ff28aae79cb59d38ad0dee383b66e.
In [126]: df = df_wo_hash[df_wo_hash["depth"] > 0]
...: res = fit(zipf, df["depth"], [(1.0, 10.0)])
...: print(res)
params: FitParams(a=1.8108806886591249, loc=0.0)
success: True
message: 'Optimization terminated successfully.'
Using the stats from the AS Set graph stats. (Edit: updated commit: https://github.com/SichangHe/internet_route_verification_meta/commit/437b5fa573b6899dbd825e75e9731178d1f8fec8)
$ python3 -m scripts.stats.as_set_size_fitting
Overview:
n_sets n_nums depth
count 53268.00 53268.00 53268.00
mean 98.11 646.85 2.37
std 946.29 5628.59 13.10
min 0.00 0.00 0.00
25% 0.00 1.00 1.00
50% 0.00 2.00 1.00
75% 1.00 6.00 1.00
max 20426.00 95572.00 300.00
AS Set sizes in AS Num counts.
7746 (14.54%) AS Sets have no AS Num.
Fitting Zipf distribution: Negative log-likelihood 146934.38357877164.
params: FitParams(a=1.4983353782683766, loc=0.0)
success: True
message: 'Optimization terminated successfully.'
AS Set nesting depths.
Fitting Zipf distribution: Negative log-likelihood inf.
params: FitParams(a=2.4292771966768467, loc=0.0)
success: False
message: 'Optimization converged to parameter values that are inconsistent with the data.'
AS Set with cycles.
3112 (5.84%) AS Sets have cycles.
3050 (22.42%) have cycles, 3129 (23.00%) have depth 5 or more, among 13602 AS Sets containing other AS Sets.
The first negative log-likelihood seems too big. The depth fitting somehow failed after we apply the more accurate stats from the AS Set graphs.
Funny; worth mentioning in the paper.
On Thu, Jan 25, 2024 at 11:50 AM Steven Hé (Sīchàng) < @.***> wrote:
62 AS Sets contain themselves and no other AS Sets.
In [7]: has_cycle_no_set = df[df['has_cycle'] & (df['n_sets'] == 0)] In [8]: has_cycle_no_setOut[8]: as_set n_sets n_nums depth has_cycle335 AS399899:AS-ALL 0 0 0 True877 AS266594:AS-DATACENTRICS 0 0 0 True1300 AS-400352 0 0 0 True2313 AS272677:AS-FIBRATECHTELECOM-ONLY 0 0 0 True4134 AS-395359 0 0 0 True ... ... ... ... ... ...63388 AS-STELECOM-ALL 0 8 1 True64771 AS270436:AS-SOOU-TELECOM 0 1 1 True69368 AS-HERTZ 0 1 1 True70309 AS-PDL 0 1 1 True70678 as-ouiheberg 0 4 1 True
[62 rows x 5 columns] In [9]: print(has_cycle_no_set.to_string()) as_set n_sets n_nums depth has_cycle335 AS399899:AS-ALL 0 0 0 True877 AS266594:AS-DATACENTRICS 0 0 0 True1300 AS-400352 0 0 0 True2313 AS272677:AS-FIBRATECHTELECOM-ONLY 0 0 0 True4134 AS-395359 0 0 0 True4467 AS268525:AS-E3 0 2 1 True5146 AS-FUTURE 0 0 0 True5951 AS-ETELECOM-2 0 0 0 True9676 AS29831:AS-LEGACY 0 0 0 True10654 AS-11700 0 0 0 True11515 AS17917:AS-55474 0 0 0 True12037 AS270912:AS-GMN-TELECOM-ONLY 0 0 0 True12780 AS-42386 0 0 0 True14320 AS136093:AS-SET-FAZNET 0 0 0 True14760 AS269782:AS-CUSTOMERS 0 4 1 True14870 AS-38215 0 0 0 True16693 AS264301:AS-LNP-TELECOM 0 1 1 True16809 AS-LEKKS 0 0 0 True16968 AS-ALBIDEYNET 0 0 0 True17071 AS-63603-GSS 0 2 1 True17102 as-44702 0 0 0 True17869 AS268215:AS-REDESPEED-ONLY 0 0 0 True20497 AS-397178 0 0 0 True21150 AS-13720-SOE-2 0 0 0 True26822 AS-BURSABIL 0 2 1 True27301 AS400457:AS-HWLC 0 0 0 True27943 AS268430:AS-FLASHLINK-INTERNET 0 1 1 True28304 AS-SDV-BERLIN 0 1 1 True32296 AS-33350 0 0 0 True32773 AS-NATCO 0 0 0 True33330 AS26388:AS-ALL 0 1 1 True33436 AS46816:AS-ALL 0 1 1 True33457 AS-19256 0 4 1 True33627 AS-ITHOLDINGSCUSTOMERS 0 33 1 True36517 AS266671:AS-ALLv6 0 1 1 True38950 as-hsams 0 1 1 True39662 AS-IPv6-edu-pl 0 1 1 True40957 AS265283:AS-IAGONET-TELECOM 0 1 1 True41057 AS19879:AS-CORE 0 1 1 True44263 AS269782:AS-NETWORKSPEED-001 0 1 1 True44328 AS-RLINE1 0 0 0 True44731 AS-397446 0 1 1 True45079 AS-394036 0 0 0 True45407 as-210750 0 0 0 True46907 AS-58650 0 0 0 True47889 AS30002:AS-NULOOP 0 1 1 True48593 AS-AL-2098 0 1 1 True49858 AS-BDNET 0 4 1 True51126 AS-OPERATELECOM 0 0 0 True51959 AS-398052 0 0 0 True55095 AS-AS32489 0 0 0 True56880 AS-RIKSNET 0 5 1 True57650 AS400088:AS-WAWA-PUBLIC 0 0 0 True57842 AS-21783 0 1 1 True58890 AS-NETNITCO 0 1 1 True59504 AS-36113 0 0 0 True61486 AS204141:AS-KNETDK 0 1 1 True63388 AS-STELECOM-ALL 0 8 1 True64771 AS270436:AS-SOOU-TELECOM 0 1 1 True69368 AS-HERTZ 0 1 1 True70309 AS-PDL 0 1 1 True70678 as-ouiheberg 0 4 1 True
— Reply to this email directly, view it on GitHub https://github.com/SichangHe/internet_route_verification/issues/114#issuecomment-1909299201, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACPO5ZHAKGQBUJZ4UOZS3LYQHJANAVCNFSM6AAAAABBBO4AZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBZGI4TSMRQGE . You are receiving this because you commented.Message ID: @.***>
For the goodness of fit of the Zipf distribution, it seems that Chi-squared test may be our best bet. It is quite a pain to find testing for discrete distributions. Related.
"This test is invalid when the observed or expected frequencies in each category are too small. A typical rule is that all of the observed and expected frequencies should be at least 5."
Maybe it doesn't work for the AS-set sizes (and Zipf distributions in general, as most AS-set sizes will have very few samples).
I looked around for a reasonable test for us to use, but didn't find anything easy to apply. (I've applied KS-tests in the past, but on a Zipf distro I think it will not pass even though the fit may be acceptable.) I also found a reference that r-squared isn't a very good approach (like you mentioned). Maybe let's put this on the back burner for now.
On Thu, Jan 25, 2024 at 8:37 PM Steven Hé (Sīchàng) < @.***> wrote:
For the goodness of fit of the Zipf distribution, it seems that Chi-squared test https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html may be our best bet. It is quite a pain to find testing for discrete distributions. Related https://stats.stackexchange.com/a/129734.
— Reply to this email directly, view it on GitHub https://github.com/SichangHe/internet_route_verification/issues/114#issuecomment-1910117270, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACPO5ZYKJURFTLFDQYBRVLYQJGWFAVCNFSM6AAAAABBBO4AZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJQGEYTOMRXGA . You are receiving this because you commented.Message ID: @.***>
Closing because the mention of the Zipf is removed in the text.
Commit: https://github.com/SichangHe/internet_route_verification/commit/a6cfea255dfabf633bc7de73404188cf16c285fb
as_set_sizes.csv.gz
TODO:
- [x] Fit Zipf distribution on depths.