InternetHealthReport / internet-yellow-pages

A knowledge graph for the Internet
https://iyp.iijlab.net
GNU General Public License v3.0
43 stars 18 forks source link

Include layer information to stanford.asdb AS categories #116

Closed JustinLoye closed 8 months ago

JustinLoye commented 9 months ago

Explain the dataset you want to add and how it would contribute to the Internet Yellow Pages.

stanford.asdb classifies ASN into categories (aka layer 1) and sub-categories (layer 2). However, currently iyp does not store the layer information. It could be nice to have it for several reasons:

If possible describe how you would like to model the dataset in the Yellow Pages

m-appel commented 9 months ago

We have two options:

  1. Keep the Other and model around it or
  2. Ignore all (layer 1 + 2) Other categories.

While I prefer option 2, we would ignore data from the original dataset, which we normally do not want, except if it's obviously broken. For option 2 we would only need to not push the Other node and all relationships to it, so it would be easy.

For option 1, I would prefer to not introduce (:Tag)-[:PART_OF]->(:Tag) relationships, since this seems to overcomplicate the graph for a niche (and imho almost useless) edge case. Instead I would propose to concatenate the layer 1 name to the layer 2 name like you wrote and push the node as, e.g., (:Tag {label: 'Computer and Information Technology - Other'})

In any case it is a good idea to add a layer property to the CATEGORIZED relationship.

@romain-fontugne any opinions?

romain-fontugne commented 9 months ago

I also prefer option 2, and yes, we should avoid cleaning imported dataset but in this case I don't really see the difference between level2 Other and not having that information. I think the level1 Other tag is still telling us that the AS is appearing in ASdb but the classification is not conclusive. So we may keep the level1 Other?

Also before adding the PART_OF relationship between tags we should double check that a level2 tag won't be part of two level1 tags. If this happens we should think of a smart way to handle that

JustinLoye commented 9 months ago

I double checked if layer2 tags are part of several layer1 tags. It is not the case EXCEPT for layer2 Metal, Glass, Wood, and Paper Manufacturing, that is part of both layer1 Other (for 6 ASes) and layer1 Manufacturing (for 666 ASes).

The 6 ASes with (Metal, Glass, Wood, and Paper Manufacturing) -[:PART_OF]-> (Other) are probably an error from stanford.asdb, so I suggest doing a manual correction?

Note that Metal, Glass, Wood, and Paper Manufacturing does not appear at all in the category list
https://asdb.stanford.edu/data/NAICSlite.csv So we can only assume it's an error.