Flatten AS Sets & getting sizes and other stats

SichangHe commented 6 months ago

Flattened 71346 AS Sets in 471ms.

Commit: https://github.com/SichangHe/internet_route_verification/commit/a6cfea255dfabf633bc7de73404188cf16c285fb

as_set_sizes.csv.gz

TODO:

[x] Fit Zipf distribution on sizes.
- [ ] Goodness of fit. ~~- [x] Fit Zipf distribution on depths.~~
[x] Count cycles.

SichangHe commented 6 months ago

Overview of AS Set sizes (including pseudo sets).

In [2]: df = pd.read_csv('as_set_sizes.csv.gz')

In [3]: df
Out[3]:
                        as_set  size
0            AS-399760-CLIENTS     1
1                   AS-ISIONUK     2
2                     c#203447     2
3                  AS-NICPROXY     7
4           AS199221:AS-199221     2
...                        ...   ...
71341                 AS-BITON     1
71342                 AS-30998     8
71343    AS268235:AS-CUSTOMERS    42
71344                    AS-V6    16
71345  AS-SET-LEVEL-7-INTERNET     5

[71346 rows x 2 columns]

In [4]: df.describe()
Out[4]:
               size
count  71346.000000
mean     439.580089
std     4461.051650
min        0.000000
25%        1.000000
50%        1.000000
75%        5.000000
max    93709.000000

SichangHe commented 6 months ago

Stats for pseudo sets and real sets:

In [5]: df_w_hash = df[df['as_set'].str.contains('#')]

In [6]: df_w_hash
Out[6]:
                                    as_set  size
2                                 c#203447     2
5                                  c#22987     5
6                                 c#395466     1
8                                  c#61605     2
9                m#AS-CBL-TRANSIT#MNT-BAHA     1
...                                    ...   ...
71309  m#AS-IXBR-TRANSIT4-SP#MAINT-AS28186     1
71318                              c#59239    58
71326                             c#212271     2
71337                               c#7795    76
71339                              c#46393     1

[17970 rows x 2 columns]

In [7]: df_wo_hash = df[~df['as_set'].str.contains('#')]

In [8]: df_wo_hash
Out[8]:
                        as_set  size
0            AS-399760-CLIENTS     1
1                   AS-ISIONUK     2
3                  AS-NICPROXY     7
4           AS199221:AS-199221     2
7         AS28855:AS-CUSTOMERS     1
...                        ...   ...
71341                 AS-BITON     1
71342                 AS-30998     8
71343    AS268235:AS-CUSTOMERS    42
71344                    AS-V6    16
71345  AS-SET-LEVEL-7-INTERNET     5

[53376 rows x 2 columns]

In [9]: df_wo_hash.describe()
Out[9]:
               size
count  53376.000000
mean     584.558847
std     5149.282015
min        0.000000
25%        1.000000
50%        2.000000
75%        6.000000
max    93709.000000

SichangHe commented 6 months ago

Sketch histogram.

Code used (generated).

```python import matplotlib.pyplot as plt # Plotting histogram plt.hist(df_wo_hash['size'], bins=1000, edgecolor='black') plt.xscale('log') plt.yscale('log') plt.xlabel('Size') plt.ylabel('Frequency') plt.title('Histogram of Sizes in df_wo_hash') plt.grid(True) plt.show() ```

SichangHe commented 6 months ago

AS Sets (86MiB): as_sets.txt.gz

SichangHe commented 5 months ago

It seems that the Zipf distribution fitting failed!?

IPython history.

```python In [1]: import pandas as pd pdf In [2]: df = pd.read_csv('as_set_sizes.csv.gz') In [3]: df Out[3]: as_set size 0 AS-399760-CLIENTS 1 1 AS-ISIONUK 2 2 c#203447 2 3 AS-NICPROXY 7 4 AS199221:AS-199221 2 ... ... ... 71341 AS-BITON 1 71342 AS-30998 8 71343 AS268235:AS-CUSTOMERS 42 71344 AS-V6 16 71345 AS-SET-LEVEL-7-INTERNET 5 [71346 rows x 2 columns] In [5]: from scipy.stats import zipf, fit In [8]: res = fit(zipf, df["size"], [(1.0, 10.0)]) In [9]: res Out[9]: params: FitParams(a=1.5713559159360042, loc=0.0) success: False message: 'Optimization converged to parameter values that are inconsistent with the data.' ```

PMF Plotting.

![image](https://github.com/SichangHe/internet_route_verification/assets/84777573/84b986b5-e550-45aa-bef1-2afe708e7201) Commit: https://github.com/SichangHe/internet_route_verification_meta/commit/d912451089969457bc9313a989460424c18b73a6

cunha commented 5 months ago

I've never seen anything like that before. I'd check whether the parameters passed to the function are correct... if that's not it then it may be that the data doesn't fit a Zipf distribution in the end.

From: Steven Hé (Sīchàng) @.> Sent: Monday, January 8, 2024 9:54 AM To: SichangHe/internet_route_verification @.> Cc: Subscribed @.***> Subject: Re: [SichangHe/internet_route_verification] Flatten AS Sets & getting sizes (Issue #114)

It seems that the Zipf distribution fitting failed!?

IPython history.

In [1]: import pandas as pd pdf In [2]: df = pd.read_csv('as_set_sizes.csv.gz')

In [3]: df Out[3]: as_set size 0 AS-399760-CLIENTS 1 1 AS-ISIONUK 2 2 c#203447 2 3 AS-NICPROXY 7 4 AS199221:AS-199221 2 ... ... ... 71341 AS-BITON 1 71342 AS-30998 8 71343 AS268235:AS-CUSTOMERS 42 71344 AS-V6 16 71345 AS-SET-LEVEL-7-INTERNET 5

[71346 rows x 2 columns]

In [5]: from scipy.stats import zipf, fit

In [8]: res = fit(zipf, df["size"], [(1.0, 10.0)])

In [9]: res Out[9]: params: FitParams(a=1.5713559159360042, loc=0.0) success: False message: 'Optimization converged to parameter values that are inconsistent with the data.'

— Reply to this email directly, view it on GitHubhttps://github.com/SichangHe/internet_route_verification/issues/114#issuecomment-1880290704, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AACPO5YOLIU7YF7N6XYVHPDYNNGWFAVCNFSM6AAAAABBBO4AZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBQGI4TANZQGQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>

SichangHe commented 5 months ago

… Not a Zipf.

SichangHe commented 5 months ago

as_set_sizes1.csv.gz

Ipython history: https://github.com/SichangHe/internet_route_verification/issues/121

SichangHe commented 5 months ago

Previous results were wrong due to the erroneous script (fixed in https://github.com/SichangHe/internet_route_verification/commit/e29d04dd57693a862efe401e023e33651b964a8b).

Results from corrected script, with depth: as_set_sizes2.csv.gz

In [79]: df_raw = pd.read_csv("as_set_sizes2.csv.gz")

In [80]: df = df_raw[~df_raw['as_set'].str.contains('#')]

In [81]: df
Out[81]:
                        as_set  size  depth
0            AS-399760-CLIENTS     1      0
1                   AS-ISIONUK     2      0
3                  AS-NICPROXY     7      0
4           AS199221:AS-199221     2      0
7         AS28855:AS-CUSTOMERS     1      0
...                        ...   ...    ...
71213                 AS-BITON     1      0
71214                 AS-30998     8      0
71215    AS268235:AS-CUSTOMERS    42      0
71216                    AS-V6    16      0
71217  AS-SET-LEVEL-7-INTERNET     5      0

[53268 rows x 3 columns]

In [82]: df.describe()
Out[82]:
               size         depth
count  53268.000000  53268.000000
mean     648.176260      1.730194
std     5636.609586     13.263516
min        0.000000      0.000000
25%        1.000000      0.000000
50%        2.000000      0.000000
75%        6.000000      0.000000
max    95591.000000    299.000000

Distribution of sizes are almost unchanged, because the previous bug only impacted the few nested sets. As we can see now, most sets are flat.

Still not a Zipf.

![image](https://github.com/SichangHe/internet_route_verification/assets/84777573/cdb66439-cff6-44e8-b64c-99fea161b9dc) Same script used as above.

SichangHe commented 5 months ago

In [119]: df.tail(260)
Out[119]:
                    as_set   size  depth
53008         AS3326:AS-UA  89580     40
53009   AS15562:AS-LEVEL42      1     41
53010   AS15562:AS-LEVEL43      1     42
53011   AS15562:AS-LEVEL44      1     43
53012   AS15562:AS-LEVEL45      1     44
...                    ...    ...    ...
53263  AS15562:AS-LEVEL296      1    295
53264  AS15562:AS-LEVEL297      1    296
53265  AS15562:AS-LEVEL298      1    297
53266  AS15562:AS-LEVEL299      1    298
53267  AS15562:AS-LEVEL300      1    299

[260 rows x 3 columns]

It seems that AS15562 just have a chain of AS Sets. Though, they are rather the anomaly.

SichangHe commented 5 months ago

The Zipf distribution fits after removing sets with no members: https://github.com/SichangHe/internet_route_verification_meta/commit/cf00b74cd7b06d3f4758aa0b29918adf35c618f5

In [122]: res = fit(zipf, df[df["size"] > 0]["size"], [(1.0, 10.0)])
     ...: print(res)
  params: FitParams(a=1.498530833262867, loc=0.0)
 success: True
 message: 'Optimization terminated successfully.'

SichangHe commented 5 months ago

Depths also fit: https://github.com/SichangHe/internet_route_verification_meta/commit/f3a25d00477ff28aae79cb59d38ad0dee383b66e.

In [126]:     df = df_wo_hash[df_wo_hash["depth"] > 0]
     ...:     res = fit(zipf, df["depth"], [(1.0, 10.0)])
     ...:     print(res)
  params: FitParams(a=1.8108806886591249, loc=0.0)
 success: True
 message: 'Optimization terminated successfully.'

SichangHe commented 5 months ago

AS Sets (226M): https://github.com/SichangHe/internet_route_verification/releases/download/data-114/as_sets2.txt.gz

SichangHe commented 5 months ago

as_set_graph_stats.csv

SichangHe commented 5 months ago

Using the stats from the AS Set graph stats. (Edit: updated commit: https://github.com/SichangHe/internet_route_verification_meta/commit/437b5fa573b6899dbd825e75e9731178d1f8fec8)

$ python3 -m scripts.stats.as_set_size_fitting
Overview:
         n_sets    n_nums     depth
count  53268.00  53268.00  53268.00
mean      98.11    646.85      2.37
std      946.29   5628.59     13.10
min        0.00      0.00      0.00
25%        0.00      1.00      1.00
50%        0.00      2.00      1.00
75%        1.00      6.00      1.00
max    20426.00  95572.00    300.00

AS Set sizes in AS Num counts.
7746 (14.54%) AS Sets have no AS Num.
Fitting Zipf distribution: Negative log-likelihood 146934.38357877164.
  params: FitParams(a=1.4983353782683766, loc=0.0)
 success: True
 message: 'Optimization terminated successfully.'

AS Set nesting depths.
Fitting Zipf distribution: Negative log-likelihood inf.
  params: FitParams(a=2.4292771966768467, loc=0.0)
 success: False
 message: 'Optimization converged to parameter values that are inconsistent with the data.'

AS Set with cycles.
3112 (5.84%) AS Sets have cycles.
3050 (22.42%) have cycles, 3129 (23.00%) have depth 5 or more, among 13602 AS Sets containing other AS Sets.

The first negative log-likelihood seems too big. The depth fitting somehow failed after we apply the more accurate stats from the AS Set graphs.

SichangHe commented 5 months ago

62 AS Sets contain themselves and no other AS Sets.

```python In [7]: has_cycle_no_set = df[df['has_cycle'] & (df['n_sets'] == 0)] In [8]: has_cycle_no_set Out[8]: as_set n_sets n_nums depth has_cycle 335 AS399899:AS-ALL 0 0 0 True 877 AS266594:AS-DATACENTRICS 0 0 0 True 1300 AS-400352 0 0 0 True 2313 AS272677:AS-FIBRATECHTELECOM-ONLY 0 0 0 True 4134 AS-395359 0 0 0 True ... ... ... ... ... ... 63388 AS-STELECOM-ALL 0 8 1 True 64771 AS270436:AS-SOOU-TELECOM 0 1 1 True 69368 AS-HERTZ 0 1 1 True 70309 AS-PDL 0 1 1 True 70678 as-ouiheberg 0 4 1 True [62 rows x 5 columns] In [9]: print(has_cycle_no_set.to_string()) as_set n_sets n_nums depth has_cycle 335 AS399899:AS-ALL 0 0 0 True 877 AS266594:AS-DATACENTRICS 0 0 0 True 1300 AS-400352 0 0 0 True 2313 AS272677:AS-FIBRATECHTELECOM-ONLY 0 0 0 True 4134 AS-395359 0 0 0 True 4467 AS268525:AS-E3 0 2 1 True 5146 AS-FUTURE 0 0 0 True 5951 AS-ETELECOM-2 0 0 0 True 9676 AS29831:AS-LEGACY 0 0 0 True 10654 AS-11700 0 0 0 True 11515 AS17917:AS-55474 0 0 0 True 12037 AS270912:AS-GMN-TELECOM-ONLY 0 0 0 True 12780 AS-42386 0 0 0 True 14320 AS136093:AS-SET-FAZNET 0 0 0 True 14760 AS269782:AS-CUSTOMERS 0 4 1 True 14870 AS-38215 0 0 0 True 16693 AS264301:AS-LNP-TELECOM 0 1 1 True 16809 AS-LEKKS 0 0 0 True 16968 AS-ALBIDEYNET 0 0 0 True 17071 AS-63603-GSS 0 2 1 True 17102 as-44702 0 0 0 True 17869 AS268215:AS-REDESPEED-ONLY 0 0 0 True 20497 AS-397178 0 0 0 True 21150 AS-13720-SOE-2 0 0 0 True 26822 AS-BURSABIL 0 2 1 True 27301 AS400457:AS-HWLC 0 0 0 True 27943 AS268430:AS-FLASHLINK-INTERNET 0 1 1 True 28304 AS-SDV-BERLIN 0 1 1 True 32296 AS-33350 0 0 0 True 32773 AS-NATCO 0 0 0 True 33330 AS26388:AS-ALL 0 1 1 True 33436 AS46816:AS-ALL 0 1 1 True 33457 AS-19256 0 4 1 True 33627 AS-ITHOLDINGSCUSTOMERS 0 33 1 True 36517 AS266671:AS-ALLv6 0 1 1 True 38950 as-hsams 0 1 1 True 39662 AS-IPv6-edu-pl 0 1 1 True 40957 AS265283:AS-IAGONET-TELECOM 0 1 1 True 41057 AS19879:AS-CORE 0 1 1 True 44263 AS269782:AS-NETWORKSPEED-001 0 1 1 True 44328 AS-RLINE1 0 0 0 True 44731 AS-397446 0 1 1 True 45079 AS-394036 0 0 0 True 45407 as-210750 0 0 0 True 46907 AS-58650 0 0 0 True 47889 AS30002:AS-NULOOP 0 1 1 True 48593 AS-AL-2098 0 1 1 True 49858 AS-BDNET 0 4 1 True 51126 AS-OPERATELECOM 0 0 0 True 51959 AS-398052 0 0 0 True 55095 AS-AS32489 0 0 0 True 56880 AS-RIKSNET 0 5 1 True 57650 AS400088:AS-WAWA-PUBLIC 0 0 0 True 57842 AS-21783 0 1 1 True 58890 AS-NETNITCO 0 1 1 True 59504 AS-36113 0 0 0 True 61486 AS204141:AS-KNETDK 0 1 1 True 63388 AS-STELECOM-ALL 0 8 1 True 64771 AS270436:AS-SOOU-TELECOM 0 1 1 True 69368 AS-HERTZ 0 1 1 True 70309 AS-PDL 0 1 1 True 70678 as-ouiheberg 0 4 1 True ```

cunha commented 5 months ago

Funny; worth mentioning in the paper.

On Thu, Jan 25, 2024 at 11:50 AM Steven Hé (Sīchàng) < @.***> wrote:

62 AS Sets contain themselves and no other AS Sets.

In [7]: has_cycle_no_set = df[df['has_cycle'] & (df['n_sets'] == 0)] In [8]: has_cycle_no_setOut[8]: as_set n_sets n_nums depth has_cycle335 AS399899:AS-ALL 0 0 0 True877 AS266594:AS-DATACENTRICS 0 0 0 True1300 AS-400352 0 0 0 True2313 AS272677:AS-FIBRATECHTELECOM-ONLY 0 0 0 True4134 AS-395359 0 0 0 True ... ... ... ... ... ...63388 AS-STELECOM-ALL 0 8 1 True64771 AS270436:AS-SOOU-TELECOM 0 1 1 True69368 AS-HERTZ 0 1 1 True70309 AS-PDL 0 1 1 True70678 as-ouiheberg 0 4 1 True

[62 rows x 5 columns] In [9]: print(has_cycle_no_set.to_string()) as_set n_sets n_nums depth has_cycle335 AS399899:AS-ALL 0 0 0 True877 AS266594:AS-DATACENTRICS 0 0 0 True1300 AS-400352 0 0 0 True2313 AS272677:AS-FIBRATECHTELECOM-ONLY 0 0 0 True4134 AS-395359 0 0 0 True4467 AS268525:AS-E3 0 2 1 True5146 AS-FUTURE 0 0 0 True5951 AS-ETELECOM-2 0 0 0 True9676 AS29831:AS-LEGACY 0 0 0 True10654 AS-11700 0 0 0 True11515 AS17917:AS-55474 0 0 0 True12037 AS270912:AS-GMN-TELECOM-ONLY 0 0 0 True12780 AS-42386 0 0 0 True14320 AS136093:AS-SET-FAZNET 0 0 0 True14760 AS269782:AS-CUSTOMERS 0 4 1 True14870 AS-38215 0 0 0 True16693 AS264301:AS-LNP-TELECOM 0 1 1 True16809 AS-LEKKS 0 0 0 True16968 AS-ALBIDEYNET 0 0 0 True17071 AS-63603-GSS 0 2 1 True17102 as-44702 0 0 0 True17869 AS268215:AS-REDESPEED-ONLY 0 0 0 True20497 AS-397178 0 0 0 True21150 AS-13720-SOE-2 0 0 0 True26822 AS-BURSABIL 0 2 1 True27301 AS400457:AS-HWLC 0 0 0 True27943 AS268430:AS-FLASHLINK-INTERNET 0 1 1 True28304 AS-SDV-BERLIN 0 1 1 True32296 AS-33350 0 0 0 True32773 AS-NATCO 0 0 0 True33330 AS26388:AS-ALL 0 1 1 True33436 AS46816:AS-ALL 0 1 1 True33457 AS-19256 0 4 1 True33627 AS-ITHOLDINGSCUSTOMERS 0 33 1 True36517 AS266671:AS-ALLv6 0 1 1 True38950 as-hsams 0 1 1 True39662 AS-IPv6-edu-pl 0 1 1 True40957 AS265283:AS-IAGONET-TELECOM 0 1 1 True41057 AS19879:AS-CORE 0 1 1 True44263 AS269782:AS-NETWORKSPEED-001 0 1 1 True44328 AS-RLINE1 0 0 0 True44731 AS-397446 0 1 1 True45079 AS-394036 0 0 0 True45407 as-210750 0 0 0 True46907 AS-58650 0 0 0 True47889 AS30002:AS-NULOOP 0 1 1 True48593 AS-AL-2098 0 1 1 True49858 AS-BDNET 0 4 1 True51126 AS-OPERATELECOM 0 0 0 True51959 AS-398052 0 0 0 True55095 AS-AS32489 0 0 0 True56880 AS-RIKSNET 0 5 1 True57650 AS400088:AS-WAWA-PUBLIC 0 0 0 True57842 AS-21783 0 1 1 True58890 AS-NETNITCO 0 1 1 True59504 AS-36113 0 0 0 True61486 AS204141:AS-KNETDK 0 1 1 True63388 AS-STELECOM-ALL 0 8 1 True64771 AS270436:AS-SOOU-TELECOM 0 1 1 True69368 AS-HERTZ 0 1 1 True70309 AS-PDL 0 1 1 True70678 as-ouiheberg 0 4 1 True

— Reply to this email directly, view it on GitHub https://github.com/SichangHe/internet_route_verification/issues/114#issuecomment-1909299201, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACPO5ZHAKGQBUJZ4UOZS3LYQHJANAVCNFSM6AAAAABBBO4AZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBZGI4TSMRQGE . You are receiving this because you commented.Message ID: @.***>

SichangHe commented 5 months ago

For the goodness of fit of the Zipf distribution, it seems that Chi-squared test may be our best bet. It is quite a pain to find testing for discrete distributions. Related.

cunha commented 5 months ago

"This test is invalid when the observed or expected frequencies in each category are too small. A typical rule is that all of the observed and expected frequencies should be at least 5."

Maybe it doesn't work for the AS-set sizes (and Zipf distributions in general, as most AS-set sizes will have very few samples).

I looked around for a reasonable test for us to use, but didn't find anything easy to apply. (I've applied KS-tests in the past, but on a Zipf distro I think it will not pass even though the fit may be acceptable.) I also found a reference that r-squared isn't a very good approach (like you mentioned). Maybe let's put this on the back burner for now.

On Thu, Jan 25, 2024 at 8:37 PM Steven Hé (Sīchàng) < @.***> wrote:

For the goodness of fit of the Zipf distribution, it seems that Chi-squared test https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html may be our best bet. It is quite a pain to find testing for discrete distributions. Related https://stats.stackexchange.com/a/129734.

— Reply to this email directly, view it on GitHub https://github.com/SichangHe/internet_route_verification/issues/114#issuecomment-1910117270, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACPO5ZYKJURFTLFDQYBRVLYQJGWFAVCNFSM6AAAAABBBO4AZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJQGEYTOMRXGA . You are receiving this because you commented.Message ID: @.***>

SichangHe commented 1 month ago

Closing because the mention of the Zipf is removed in the text.

SichangHe / internet_route_verification

Flatten AS Sets & getting sizes and other stats #114