InternetHealthReport / internet-yellow-pages

A knowledge graph for Internet resources
GNU General Public License v3.0
39 stars 16 forks source link

Download compressed ROA file in `crawlers.ripe.roa` #117

Closed m-appel closed 7 months ago

m-appel commented 7 months ago

After February 12th, the roas.csv file we fetch from RIPE's FTP server will no longer be available in an uncompressed format (see here). Instead we need to download (and handle) the compressed roas.csv.xz file.

romain-fontugne commented 7 months ago

ah, I forgot about this crawler... related to this we updated the code for the route-origin-validation library: https://github.com/InternetHealthReport/route-origin-validator/commit/dc8e0b54cae65d5307e8a85294df7cfc2f025c8d

we could reuse these modifications

MAVRICK-1 commented 7 months ago

@m-appel I have already solved this issue :-)

romain-fontugne commented 7 months ago

@MAVRICK-1 we also have to fix the issue here: https://github.com/InternetHealthReport/internet-yellow-pages/blob/9cf01c53e16c6d9f5be7395a157d49609a2bf7dc/iyp/crawlers/ripe/roa.py#L43-L54

MAVRICK-1 commented 7 months ago

@MAVRICK-1 we also have to fix the issue here: https://github.com/InternetHealthReport/internet-yellow-pages/blob/9cf01c53e16c6d9f5be7395a157d49609a2bf7dc/iyp/crawlers/ripe/roa.py#L43-L54

Sir , I was referring to this https://github.com/InternetHealthReport/route-origin-validator/commit/dc8e0b54cae65d5307e8a85294df7cfc2f025c8d in route-origin-validator. :- ).

romain-fontugne commented 7 months ago

yes, let's just reuse the same code

MAVRICK-1 commented 7 months ago

yes, let's just reuse the same code

If I am not wrong Then https://github.com/InternetHealthReport/internet-yellow-pages/blob/9cf01c53e16c6d9f5be7395a157d49609a2bf7dc/iyp/crawlers/ripe/roa.py#L52-L68

this code fetches data from a RIPE ROA file, processes the CSV content line by line, extracts relevant information, and aggregates the data into a dictionary (prefix_info). This dictionary is structured with prefixes as keys, and each prefix key has a list of dictionaries associated with it, containing information about URLs, ASNs, and other details for that prefix.

As https://ftp.ripe.net/rpki/apnic.tal/2023/11/06/ contains output.json.xz file we can decompress it and use that.

m-appel commented 7 months ago

The output.json.xz file does not contain the validity period of the ROA, which we would like to keep, so let's use roas.csv.xz.