ASes with policy & CDF of number of policies

SichangHe commented 11 months ago

50999 ASes out of 95911 have at least one policy recorded.

IPython record.

```python In [1]: import pandas as pd In [2]: df = pd.read_csv("as_neighbors_vs_rules.csv") In [3]: df Out[3]: aut_num neighbor import export 0 202125 3 2 2 1 266498 29 4 3 2 132756 -1 0 0 3 41937 13 8 8 4 396741 1 0 0 ... ... ... ... ... 95906 327736 -1 0 0 95907 201276 -1 2 2 95908 32284 2 0 0 95909 399684 3 0 0 95910 136615 1 0 0 [95911 rows x 4 columns] In [4]: df.count() Out[4]: aut_num 95911 neighbor 95911 import 95911 export 95911 dtype: int64 In [6]: df[(df['import'] > 0) | (df['export'] > 0)] Out[6]: aut_num neighbor import export 0 202125 3 2 2 1 266498 29 4 3 3 41937 13 8 8 6 211753 1 1 1 7 1126 134 148 149 ... ... ... ... ... 95901 204472 -1 0 3 95902 207560 2 3 3 95904 201318 2 3 3 95905 17759 1 3 3 95907 201276 -1 2 2 [50999 rows x 4 columns] In [7]: df[(df['import'] > 0) | (df['export'] > 0)].count() Out[7]: aut_num 50999 neighbor 50999 import 50999 export 50999 dtype: int64 ```

Following data from #19.

Correction: using data from #67, should be 51262 out of 78951 ASes in the IRR that have rules.

[x] #103

cunha commented 11 months ago

Very interesting. I was expecting this number to be much lower. Can you plot a CDF? (If you export a file with the number of RPSL rules per AS, one AS per line, I can send you a notebook back. Although these days ChatGPT might just generate the matplotlib code for you.)

SichangHe commented 11 months ago

Regenerated the same stats as in #19 using latest parse results. They are marginally different.

Evcxr output.

```elixir shape: (95_911, 4) ┌─────────┬──────────┬────────┬────────┐ │ aut_num ┆ neighbor ┆ import ┆ export │ │ --- ┆ --- ┆ --- ┆ --- │ │ u64 ┆ i32 ┆ u32 ┆ u32 │ ╞═════════╪══════════╪════════╪════════╡ │ 137136 ┆ 1 ┆ 0 ┆ 0 │ │ 27322 ┆ -1 ┆ 0 ┆ 0 │ │ 206870 ┆ 3 ┆ 2 ┆ 2 │ │ 60824 ┆ 2 ┆ 2 ┆ 2 │ │ … ┆ … ┆ … ┆ … │ │ 61414 ┆ 11 ┆ 0 ┆ 0 │ │ 28637 ┆ 29 ┆ 0 ┆ 0 │ │ 205649 ┆ -1 ┆ 3 ┆ 3 │ │ 328130 ┆ -1 ┆ 0 ┆ 0 │ └─────────┴──────────┴────────┴────────┘ shape: (9, 5) ┌────────────┬───────────────┬────────────┬───────────┬───────────┐ │ describe ┆ aut_num ┆ neighbor ┆ import ┆ export │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════╪═══════════════╪════════════╪═══════════╪═══════════╡ │ count ┆ 95911.0 ┆ 95911.0 ┆ 95911.0 ┆ 95911.0 │ │ null_count ┆ 0.0 ┆ 0.0 ┆ 0.0 ┆ 0.0 │ │ mean ┆ 125671.693403 ┆ 10.15602 ┆ 4.299079 ┆ 4.185776 │ │ std ┆ 112437.127515 ┆ 102.243572 ┆ 52.094405 ┆ 48.255334 │ │ min ┆ 1.0 ┆ -1.0 ┆ 0.0 ┆ 0.0 │ │ 25% ┆ 34821.0 ┆ 1.0 ┆ 0.0 ┆ 0.0 │ │ 50% ┆ 62122.0 ┆ 1.0 ┆ 1.0 ┆ 1.0 │ │ 75% ┆ 205279.5 ┆ 3.0 ┆ 2.0 ┆ 2.0 │ │ max ┆ 6.131644e6 ┆ 9628.0 ┆ 5724.0 ┆ 5344.0 │ └────────────┴───────────────┴────────────┴───────────┴───────────┘ ```

as_neighbors_vs_rules.csv

SichangHe commented 11 months ago

For the CDF, is this what you are talking about, @cunha?

IPython history.

```python In [2]: import pandas as pd In [3]: df = pd.read_csv("as_neighbors_vs_rules.csv") In [5]: df['ports'] = df['import'] + df['export'] In [6]: df Out[6]: aut_num neighbor import export ports 0 137136 1 0 0 0 1 27322 -1 0 0 0 2 206870 3 2 2 4 3 60824 2 2 2 4 4 139868 2 0 0 0 ... ... ... ... ... ... 95906 41439 2 0 0 0 95907 61414 11 0 0 0 95908 28637 29 0 0 0 95909 205649 -1 3 3 6 95910 328130 -1 0 0 0 [95911 rows x 5 columns] In [8]: import matplotlib.pyplot as plt ...: ...: # Assuming you have already loaded your DataFrame df ...: ...: # Sort the "ports" column in ascending order ...: sorted_ports = df['ports'].sort_values() ...: ...: # Calculate the cumulative distribution function (CDF) ...: cdf = sorted_ports.reset_index(drop=True).cumsum() / sorted_ports.sum() ...: ...: # Create a plot of the CDF ...: plt.figure(figsize=(10, 6)) ...: plt.plot(sorted_ports, cdf, marker='x', linestyle='none') ...: plt.xlabel('Ports') ...: plt.ylabel('CDF') ...: plt.title('CDF of Ports Column') ...: plt.grid() ...: plt.show() ```

CDF plot.

![Figure_1](https://github.com/SichangHe/internet_route_verification/assets/84777573/4bb2b3de-f49f-441a-b3c9-82f25d1bf1f4)

cunha commented 11 months ago

That is exactly it. Thanks. Somewhat interesting again! When plotting graphs, I usually try to put in some effort to make the graph stand out alone. I usually avoid titles, but always add meaningful labels on each axis. In here, I think we should have:

Y = "Cumulative Fraction o ASes" X = "Number of Import and Export Rules"

Given the graph is concentrated close to zero, plotting the X-axis from 1 to 10000 using a log-scale might be better to give us a better idea of the data on the left.

cunha commented 11 months ago

Wrote this up right now, will polish and put it up on my website later... but should be useful to give you a head start:

https://docs.google.com/document/d/1Q5ZrvAOA0fHn883DQaPovIfpz4JB-orKFA5ZSfuWLng/edit?usp=sharing

SichangHe commented 11 months ago

The above plot was just for verifying the content.

I have set up scripts in the paper repo so we can hopefully automate this in the future.

Wide version

[CDF-AS-rules.pdf](https://github.com/SichangHe/internet_route_verification/files/12839501/CDF-AS-rules.pdf)

Squared version

[CDF-AS-rules-squared.pdf](https://github.com/SichangHe/internet_route_verification/files/12839500/CDF-AS-rules-squared.pdf)

SichangHe commented 11 months ago

The CDF does look logarithmic.

PNG preview.

![Figure_1](https://github.com/SichangHe/internet_route_verification/assets/84777573/1cacb4f6-9c0b-4a5c-a32d-4500f82e44e4)

cunha commented 11 months ago

+1; I think the logscale helps in identifying breakpoints (e.g., 45% of ASes have 100 or less rules and almost 30% have more than 1000 rules). Really? 1000 rules? What are these people doing? :)

SichangHe commented 11 months ago

All the ASes with more than 1000 rules (96 in total). (whois them to see their rules)

AS4455, AS12552, AS15925, AS10429, AS21385, AS59613, AS3303, AS29119, AS61374, AS8422, AS3257, AS3216, AS9121, AS3327, AS50952, AS31133, AS29076, AS8359, AS6667, AS20485, AS6730, AS5607, AS8897, AS7717, AS24940, AS8881, AS8447, AS8469, AS12731, AS8426, AS6881, AS30740, AS62047, AS48200, AS9031, AS30781, AS9002, AS43252, AS56665, AS4589, AS12350, AS12874, AS62499, AS16150, AS8657, AS44654, AS48793, AS13285, AS9049, AS6661, AS20912, AS5430, AS43531, AS56890, AS8220, AS49666, AS63034, AS3326, AS31449, AS15435, AS29527, AS10102, AS1299, AS3356, AS49544, AS42579, AS13237, AS12859, AS12389, AS41692, AS6805, AS13101, AS8545, AS6695, AS61955, AS1273, AS48850, AS12880, AS2603, AS6762, AS8455, AS8492, AS5413, AS3209, AS20562, AS42476, AS9033, AS41327, AS47228, AS8928, AS16298, AS43760, AS13037, AS5511, AS25152, AS2119

[x] Look into the Y-axis error in the CDF.

SichangHe commented 11 months ago

I went and changed to a proper way to generate CDF (using matplotlib 3.8). This looks more like it.

Preview and PDF files.

![image](https://github.com/SichangHe/internet_route_verification/assets/84777573/ca5a1f78-7ddd-4ff1-afa8-77a89262ca16) [CDF-AS-rules-squared.pdf](https://github.com/SichangHe/internet_route_verification/files/12843206/CDF-AS-rules-squared.pdf) [CDF-AS-rules.pdf](https://github.com/SichangHe/internet_route_verification/files/12843207/CDF-AS-rules.pdf)

cunha commented 11 months ago

Ha, this makes more sense! (It seems like you were weighting ASes by their number of rules, which we could also show. We would just need to re-label the Y axis to "Cumulative Fraction of Communities" and the Y axis to "Number of Import and Export Rules by Controlling AS".)

93% of ASes with 10 or less rules.

Now... I prefer when CDFs cover the whole [0, 1] range on the Y axis. One option to make this even better is to plot a CCDF (Complementary CDF). It's the same data you have, but you make do:

ccdf_points = [(x, 1-y) for x, y in cdf_points]
set_ylim((1e-6, 1)) # adjust accordingly, just can't do down to 0]
set_logscale("y")

SichangHe commented 11 months ago

CCDF, preview + PDF.

![image](https://github.com/SichangHe/internet_route_verification/assets/84777573/7547bd4b-2422-4c26-a0e1-fe91d7f9a043) [CDF-AS-rules-squared.pdf](https://github.com/SichangHe/internet_route_verification/files/12857505/CDF-AS-rules-squared.pdf) [CDF-AS-rules.pdf](https://github.com/SichangHe/internet_route_verification/files/12857506/CDF-AS-rules.pdf)

SichangHe commented 11 months ago

Potential TODO: add dots of significant ASes on the graph.

SichangHe commented 11 months ago

[x] Exclude the impact from AS Relational DB as discovered in https://github.com/SichangHe/internet_route_verification/issues/64#issuecomment-1762962607.

Using data from #67. The plot only moves up a little bit.

CCDF, preview + PDF.

![image](https://github.com/SichangHe/internet_route_verification/assets/84777573/2bce91e4-75ec-4944-8d91-c10275bbb1d5) [CDF-AS-rules-squared.pdf](https://github.com/SichangHe/internet_route_verification/files/12908459/CDF-AS-rules-squared.pdf) [CDF-AS-rules.pdf](https://github.com/SichangHe/internet_route_verification/files/12908460/CDF-AS-rules.pdf)

SichangHe commented 8 months ago

Updated Import/export rules stats in the text.

Running at `internet_route_verification_meta/scripts`. ```python In [5]: from scripts.csv_files import as_neighbors_vs_rules In [6]: FILE = as_neighbors_vs_rules In [7]: df_raw = pd.read_csv(FILE.path) ...: # Remove ASes not in IRR. ...: df = df_raw.drop(df_raw[df_raw["import"] == -1].index) ...: df["rules"] = df["import"] + df["export"] In [8]: df[df['rules'] == 0] Out[8]: aut_num provider peer customer import export rules 1 58421 2 0 0 0 0 0 2 64070 -1 -1 -1 0 0 0 3 150451 -1 -1 -1 0 0 0 12 398994 1 1 0 0 0 0 19 398037 3 0 0 0 0 0 ... ... ... ... ... ... ... ... 95836 45163 -1 -1 -1 0 0 0 95837 63633 -1 -1 -1 0 0 0 95838 396609 -1 -1 -1 0 0 0 95839 141451 -1 -1 -1 0 0 0 95840 150084 1 0 0 0 0 0 [27840 rows x 7 columns] In [9]: df[df['rules'] == 0].__len__() / len(df) Out[9]: 0.3537439168498494 In [10]: df[df['rules'] > 10].__len__() / len(df) Out[10]: 0.0879404327772201 In [11]: df[df['rules'] >= 10].__len__() / len(df) Out[11]: 0.1086263198688708 In [12]: df[df['rules'] >= 1000].__len__() / len(df) Out[12]: 0.0012833382040888933 In [13]: df[df['rules'] > 1000].__len__() / len(df) Out[13]: 0.001257925566384163 In [14]: df[df['rules'] >= 1000] Out[14]: aut_num provider peer customer import export rules 1466 48850 -1 -1 -1 520 520 1040 1480 13037 3 103 5 724 726 1450 2066 30740 3 89 3 783 783 1566 4143 13285 6 191 5 589 587 1176 4210 8426 7 971 17 3073 3073 6146 ... ... ... ... ... ... ... ... 89544 10102 -1 -1 -1 1024 75 1099 91014 3356 0 71 6459 5706 5344 11050 91950 9049 4 165 410 734 734 1468 92030 1299 0 48 2307 5724 4 5728 92509 9033 -1 -1 -1 1770 1770 3540 [101 rows x 7 columns] ```

@cunha, this is how we "get a list of the ASes with more than 1000 rules".

Edit: The file: https://github.com/SichangHe/internet_route_verification/files/13895627/as_neighbors_vs_rules4.csv

SichangHe commented 4 months ago

Update after handling PeerAS: as_neighbors_vs_rules5.csv.gz, little difference.

SichangHe / internet_route_verification

ASes with policy & CDF of number of policies #60