Closed JustinLoye closed 8 months ago
We have two options:
Other
and model around it orOther
categories.While I prefer option 2, we would ignore data from the original dataset, which we normally do not want, except if it's obviously broken.
For option 2 we would only need to not push the Other
node and all relationships to it, so it would be easy.
For option 1, I would prefer to not introduce (:Tag)-[:PART_OF]->(:Tag)
relationships, since this seems to overcomplicate the graph for a niche (and imho almost useless) edge case. Instead I would propose to concatenate the layer 1 name to the layer 2 name like you wrote and push the node as, e.g., (:Tag {label: 'Computer and Information Technology - Other'})
In any case it is a good idea to add a layer
property to the CATEGORIZED
relationship.
@romain-fontugne any opinions?
I also prefer option 2, and yes, we should avoid cleaning imported dataset but in this case I don't really see the difference between level2 Other
and not having that information. I think the level1 Other
tag is still telling us that the AS is appearing in ASdb but the classification is not conclusive. So we may keep the level1 Other
?
Also before adding the PART_OF relationship between tags we should double check that a level2 tag won't be part of two level1 tags. If this happens we should think of a smart way to handle that
I double checked if layer2 tags are part of several layer1 tags.
It is not the case EXCEPT for layer2 Metal, Glass, Wood, and Paper Manufacturing
, that is part of both layer1 Other
(for 6 ASes) and layer1 Manufacturing
(for 666 ASes).
The 6 ASes with (Metal, Glass, Wood, and Paper Manufacturing) -[:PART_OF]-> (Other)
are probably an error from stanford.asdb, so I suggest doing a manual correction?
Note that Metal, Glass, Wood, and Paper Manufacturing
does not appear at all in the category list
https://asdb.stanford.edu/data/NAICSlite.csv
So we can only assume it's an error.
Explain the dataset you want to add and how it would contribute to the Internet Yellow Pages.
stanford.asdb classifies ASN into categories (aka layer 1) and sub-categories (layer 2). However, currently iyp does not store the layer information. It could be nice to have it for several reasons:
Other
, it's unclear whether it refers to the layer 1Other
or the many layer 2Other
(e.g. no distinction betweenOther
andComputer and Information Technology -> Other
)If possible describe how you would like to model the dataset in the Yellow Pages
layer
property to the links -[r:CATEGORIZED {reference_name:"stanford.asdb"}]-PART_OF
link to other tagsComputer and Information Technology
(no layer 2 information) orComputer and Information Technology -> Other
(layer 2 information but not really informative). Consider dropping theOther
layer 2 categories?