InternetHealthReport / internet-yellow-pages

A knowledge graph for Internet resources
GNU General Public License v3.0
39 stars 16 forks source link

Inconsistent Type for `reference_time` #122

Closed dpgiakatos closed 7 months ago

dpgiakatos commented 7 months ago

Describe the bug I have noticed that the reference_time does not have the same type. In some edges, the reference time is a String, and in some others, it is a Date object.

To Reproduce Execute the following Cypher query:

MATCH (cc:Country)-[edge0:COUNTRY]-(r:Ranking)-[edge1:RANK]-(a:AS)
OPTIONAL MATCH (a)-[edge2:NAME {reference_org:'PeeringDB'}]->(pdbn:Name)
OPTIONAL MATCH (a)-[edge3:NAME {reference_org:'BGP.Tools'}]->(btn:Name)
OPTIONAL MATCH (a)-[edge4:NAME {reference_org:'RIPE NCC'}]->(ripen:Name)
WITH COLLECT(DISTINCT [edge0.reference_org, edge0.reference_url, edge0.reference_time]) AS list0, COLLECT(DISTINCT [edge1.reference_org, edge1.reference_url, edge1.reference_time]) AS list1, COLLECT(DISTINCT [edge2.reference_org, edge2.reference_url, edge2.reference_time]) AS list2, COLLECT(DISTINCT [edge3.reference_org, edge3.reference_url, edge3.reference_time]) AS list3, COLLECT(DISTINCT [edge4.reference_org, edge4.reference_url, edge4.reference_time]) AS list4
UNWIND list0+list1+list2+list3+list4 AS metadata_list
RETURN DISTINCT metadata_list

Expected behavior We expected to receive the following results:

╒══════════════════════════════════════════════════════════════════════╕
│"metadata_list"                                                       │
╞══════════════════════════════════════════════════════════════════════╡
│["APNIC","http://v6data.data.labs.apnic.net/ipv6-measurement/Economies│
│/","2024-01-10T00:00:00Z"]                                            │
├──────────────────────────────────────────────────────────────────────┤
│["IHR","https://ihr.iijlab.net/ihr/api/hegemony/countries/?country={co│
│untry}&af=4","2024-01-10T00:00:00Z"]                                  │
├──────────────────────────────────────────────────────────────────────┤
│["PeeringDB","https://peeringdb.com/api/ixlan?depth=2","2024-01-10T00:│
│00:00Z"]                                                              │
├──────────────────────────────────────────────────────────────────────┤
│[null,null,null]                                                      │
├──────────────────────────────────────────────────────────────────────┤
│["BGP.Tools","https://bgp.tools/asns.csv","2024-01-10T00:00:00Z"]     │
├──────────────────────────────────────────────────────────────────────┤
│["RIPE NCC","https://ftp.ripe.net/ripe/asnames/asn.txt","2024-01-10T00│
│:00:00Z"]                                                             │
└──────────────────────────────────────────────────────────────────────┘

However, we received the following:

╒══════════════════════════════════════════════════════════════════════╕
│"metadata_list"                                                       │
╞══════════════════════════════════════════════════════════════════════╡
│["APNIC","http://v6data.data.labs.apnic.net/ipv6-measurement/Economies│
│/","2024-01-10 00:00:00+00:00"]                                       │
├──────────────────────────────────────────────────────────────────────┤
│["IHR","https://ihr.iijlab.net/ihr/api/hegemony/countries/?country={co│
│untry}&af=4","2024-01-10 00:00:00+00:00"]                             │
├──────────────────────────────────────────────────────────────────────┤
│["APNIC","http://v6data.data.labs.apnic.net/ipv6-measurement/Economies│
│/","2024-01-10T00:00:00Z"]                                            │
├──────────────────────────────────────────────────────────────────────┤
│["IHR","https://ihr.iijlab.net/ihr/api/hegemony/countries/?country={co│
│untry}&af=4","2024-01-10T00:00:00Z"]                                  │
├──────────────────────────────────────────────────────────────────────┤
│["PeeringDB","https://peeringdb.com/api/ixlan?depth=2","2024-01-10T00:│
│00:00Z"]                                                              │
├──────────────────────────────────────────────────────────────────────┤
│[null,null,null]                                                      │
├──────────────────────────────────────────────────────────────────────┤
│["BGP.Tools","https://bgp.tools/asns.csv","2024-01-10T00:00:00Z"]     │
├──────────────────────────────────────────────────────────────────────┤
│["RIPE NCC","https://ftp.ripe.net/ripe/asnames/asn.txt","2024-01-10T00│
│:00:00Z"]                                                             │
└──────────────────────────────────────────────────────────────────────┘

We can observe that we have two instances of the same response for APNIC and IHR, but in the date field, we notice that the format differs. One is a String, and the other is a Date object. Therefore, the DISTINCT operation cannot work in this case.

m-appel commented 7 months ago

It seems like our add_links and batch_add_links functions have slightly different semantics for datetimes, probably caused by the dict2str function used in add_link: https://github.com/InternetHealthReport/internet-yellow-pages/blob/c9ce0ed5dcab4356514f4f340f7d9f2ea26c8a24/iyp/__init__.py#L69-L70

Seems like add_links creates Date objects in neo4j, which is probably what we want, so I'll try to update the batch function accordingly.