Closed SichangHe closed 8 months ago
Very interesting. I was expecting this number to be much lower. Can you plot a CDF? (If you export a file with the number of RPSL rules per AS, one AS per line, I can send you a notebook back. Although these days ChatGPT might just generate the matplotlib
code for you.)
Regenerated the same stats as in #19 using latest parse results. They are marginally different.
For the CDF, is this what you are talking about, @cunha?
That is exactly it. Thanks. Somewhat interesting again! When plotting graphs, I usually try to put in some effort to make the graph stand out alone. I usually avoid titles, but always add meaningful labels on each axis. In here, I think we should have:
Y = "Cumulative Fraction o ASes" X = "Number of Import and Export Rules"
Given the graph is concentrated close to zero, plotting the X-axis from 1 to 10000 using a log-scale might be better to give us a better idea of the data on the left.
Wrote this up right now, will polish and put it up on my website later... but should be useful to give you a head start:
https://docs.google.com/document/d/1Q5ZrvAOA0fHn883DQaPovIfpz4JB-orKFA5ZSfuWLng/edit?usp=sharing
The above plot was just for verifying the content.
I have set up scripts in the paper repo so we can hopefully automate this in the future.
The CDF does look logarithmic.
+1; I think the logscale helps in identifying breakpoints (e.g., 45% of ASes have 100 or less rules and almost 30% have more than 1000 rules). Really? 1000 rules? What are these people doing? :)
I went and changed to a proper way to generate CDF (using matplotlib 3.8). This looks more like it.
Ha, this makes more sense! (It seems like you were weighting ASes by their number of rules, which we could also show. We would just need to re-label the Y axis to "Cumulative Fraction of Communities" and the Y axis to "Number of Import and Export Rules by Controlling AS".)
93% of ASes with 10 or less rules.
Now... I prefer when CDFs cover the whole [0, 1] range on the Y axis. One option to make this even better is to plot a CCDF (Complementary CDF). It's the same data you have, but you make do:
ccdf_points = [(x, 1-y) for x, y in cdf_points]
set_ylim((1e-6, 1)) # adjust accordingly, just can't do down to 0]
set_logscale("y")
Potential TODO: add dots of significant ASes on the graph.
Using data from #67. The plot only moves up a little bit.
@cunha, this is how we "get a list of the ASes with more than 1000 rules".
Edit: The file: https://github.com/SichangHe/internet_route_verification/files/13895627/as_neighbors_vs_rules4.csv
Update after handling PeerAS: as_neighbors_vs_rules5.csv.gz, little difference.
50999 ASes out of 95911 have at least one policy recorded.
IPython record.
```python In [1]: import pandas as pd In [2]: df = pd.read_csv("as_neighbors_vs_rules.csv") In [3]: df Out[3]: aut_num neighbor import export 0 202125 3 2 2 1 266498 29 4 3 2 132756 -1 0 0 3 41937 13 8 8 4 396741 1 0 0 ... ... ... ... ... 95906 327736 -1 0 0 95907 201276 -1 2 2 95908 32284 2 0 0 95909 399684 3 0 0 95910 136615 1 0 0 [95911 rows x 4 columns] In [4]: df.count() Out[4]: aut_num 95911 neighbor 95911 import 95911 export 95911 dtype: int64 In [6]: df[(df['import'] > 0) | (df['export'] > 0)] Out[6]: aut_num neighbor import export 0 202125 3 2 2 1 266498 29 4 3 3 41937 13 8 8 6 211753 1 1 1 7 1126 134 148 149 ... ... ... ... ... 95901 204472 -1 0 3 95902 207560 2 3 3 95904 201318 2 3 3 95905 17759 1 3 3 95907 201276 -1 2 2 [50999 rows x 4 columns] In [7]: df[(df['import'] > 0) | (df['export'] > 0)].count() Out[7]: aut_num 50999 neighbor 50999 import 50999 export 50999 dtype: int64 ```Following data from #19.
Correction: using data from #67, should be 51262 out of 78951 ASes in the IRR that have rules.