InternetHealthReport / internet-yellow-pages

A knowledge graph for Internet resources
GNU General Public License v3.0
39 stars 16 forks source link

Add IANA root zone file crawler #92

Closed m-appel closed 9 months ago

m-appel commented 9 months ago

This PR implements the IANA root zone file crawler and closes #82. Since this is the first crawler that adds multi-label nodes, additional changes to the OpenINTEL crawlers were required to prevent conflicts.

Description

The IANA root zone file contains NS records for the top-level domains, as well as A/AAAA records for the authoritative name servers.

This is the first crawler that introduces multi-label nodes, namely we now have a combination DomainName:AuthoritativeNameServer, since every name server is identified by a domain name. In accordance with this change, this PR updates the other crawler that creates AuthoritativeNameServer nodes, namely the OpenINTEL crawler. Without this change there are conflicting constraints.

As part of changing the OpenINTEL crawler this PR also reduces the execution time of the link-computation phase of the crawler by a factor of 10. The current version used an inefficient method for iterating over the data.

How Has This Been Tested?

These changes have been tested as part of a full database creation and also repeated independently.

Types of changes

Checklist:

romain-fontugne commented 9 months ago

thanks!