InternetHealthReport / internet-yellow-pages

A knowledge graph for Internet resources
GNU General Public License v3.0
39 stars 16 forks source link

CISCO Umbrella list #49

Closed romain-fontugne closed 1 year ago

romain-fontugne commented 1 year ago

Import CISCO's Umbrella top domain name list. Data is available here: https://umbrella-static.s3-us-west-1.amazonaws.com/index.html

The added relationships should look like this:

(:DomainName {name: 'com')-[:RANK {rank:1}]->(:RANKING {name: 'CISCO Umbrella Top 1 million})
m-appel commented 1 year ago

Do we want to create the DomainName chain we talked about as a new postprocess script? I think that would actually be more performant, since we can fetch all existing DomainName nodes in one go and then just fill in the gaps.

For completeness sake, we are thinking of modelling each part of the DNS name now, for example:

(:DomainName {name: 'g.doubleclick.net'})-[:PART_OF]->
(:DomainName {name: 'doubleclick.net'})-[:PART_OF]->
(:DomainName {name: 'net'})

because there are some DomainNames that do not resolve to an IP, but their subdomains do.

But long story short, I would create a separate issue for this?

romain-fontugne commented 1 year ago

yes, we should make a separate issue for this. It will be in a different (post) script

roopeshsn commented 1 year ago

Hi, @m-appel @romain-fontugne! Could you give me more context on this issue and the post-processing script? I understood that there are multiple domain names that belong to an IP. After crawling and pushing the data to the IYP, the post-processing script will work to group the domain names right?

m-appel commented 1 year ago

Hey, I will open a separate issue today with more details and will mark you there!