InternetHealthReport / internet-yellow-pages

A knowledge graph for the Internet
https://iyp.iijlab.net
GNU General Public License v3.0
43 stars 18 forks source link

Make IPv6 addresses canonical #87

Closed m-appel closed 5 months ago

m-appel commented 11 months ago

An IPv6 address can be compressed in different ways (see RFC 5952). For example, all these addresses are the same:

We already have a property formatter that ensures IPv6 addresses are always written in lowercase, but maybe we should extend this to ensure a canonical form.

My suggestion would be to use ipaddress module like this to achieve a consistent notation:

ip = str(ipaddress.ipaddress(ip))

The str version of IP addresses are always compressed.

The only question is if we should do this in the format_properties function or rather in a postprocessing script. This function can raise an exception (which might actually be useful) that would need to be handled within the property formatting function.

romain-fontugne commented 11 months ago

Thanks, yes we definitely need a unique way to represent IPv6. I think it's better to implement that in the format_properties function because if a crawler use the wrong format and a later crawler use IPv6 IP addresses then we will have inconsistencies in our nodes. So we should make sure we always use a common format for IPv6. If an exception happens and it's not caught then the crawler will fail be for us this makes debugging easier

m-appel commented 10 months ago

I tried this briefly and noticed an issue that is technically already present with the way we format properties: If the property formatter actually changes the values of some properties, the map returned by the get_nodes functions contains different entries.

For example we have an IP 2001:0DB8:0:0:0:0:0:1 that gets formatted to 2001:0DB8::1, the returned map contains an entry for 2001:0DB8::1. However in the crawler we will try to use the original value for, e.g., link creation, since we are unaware of the change.

This is technically already a problem, I guess the reason why nothing broke yet is that all crawlers already used the expected format to begin with.

We can either keep treating it like this, i.e., a guarantee that if you do not use the correct format already in the crawler, the script will break. Or we return something like a translation table with the returned map, but that seems like a bit of a hassle.

romain-fontugne commented 10 months ago

I think we should write some guidelines somewhere and maybe the easiest for that is our example crawler: https://github.com/InternetHealthReport/internet-yellow-pages/blob/main/iyp/crawlers/example/crawler.py

If we keep best practice there it makes it a good template to start with and not forgetting best practices