InternetHealthReport / internet-yellow-pages

A knowledge graph for Internet resources
GNU General Public License v3.0
39 stars 16 forks source link

High Time Consumption by Specific Crawlers During Indexing Process #80

Closed mohamedawnallah closed 9 months ago

mohamedawnallah commented 9 months ago

Describe the bug The indexing process currently faces an issue where two specific crawlers, iyp.crawlers.cloudflare.dns_top_locations and iyp.crawlers.cloudflare.dns_top_ases, consume an excessive amount of time, representing over 83% of the total indexing time. The former accounts for 43.07%, while the latter contributes 40.70%.

To Reproduce

  1. Download the iyp-2023-12-01.log file from the following link: https://exp1.iijlab.net/wip/iyp/dumps/2023/12/01/iyp-2023-12-01.log
  2. Utilize the provided Python script to analyze the time each process takes during the indexing process.
    
    from datetime import datetime
    import re
    import matplotlib.pyplot as plt

Log file path

log_file = "iyp-2023-12-01.log”

Initialize variables

processes = {} total_time = 0

Read log file and process lines

with open(log_file, "r") as file: lines = file.readlines() for line in lines:

Extract timestamp and process name

    timestamp_match = re.match(r"(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})", line)
    if not timestamp_match:
        continue
    timestamp_str = timestamp_match.group(1)
    timestamp = datetime.strptime(timestamp_str, "%Y-%m-%d %H:%M:%S”)

    # Use regex to extract module name within quotes after "<module ‘“
    match = re.search(r"<module '(.*?)'", line)
    if match:
        module_name = match.group(1)

        # Check for process start/end and update dictionary
        if "start" in line:
            current_process = module_name
            processes[current_process] = {"start": timestamp}
        elif "end" in line:
            if current_process in processes:
                end_time = timestamp
                processes[current_process]["end"] = end_time
                duration = (
                    end_time - processes[current_process][“start”]
                ).total_seconds()
                processes[current_process]["duration"] = (
                    duration if duration > 0 else 0
                )
                total_time += processes[current_process][“duration”]
                current_process = None

Filter out processes without duration information

processes = {process: info for process, info in processes.items() if "duration" in info}

Calculate the contribution of each process to the total time

contribution = { process: (info["duration"] / total_time) * 100 for process, info in processes.items() }

Visualize contributions

plt.figure(figsize=(10, 6)) plt.barh(list(contribution.keys()), list(contribution.values()), color=“skyblue”) plt.xlabel("Contribution to Total Time (%)”) plt.title("Process Time Contribution”) plt.gca().invert_yaxis() plt.tight_layout() plt.show()



**Expected behavior**
I’m not sure whether the extensive time consumption by these crawlers is expected behavior or indicates an issue.

**Screenshots**
A visual representation of the time contribution of various processes during indexing by utilizing the previous python script.
<img width="1082" alt="Indexing Processes Time Contribution" src="https://github.com/InternetHealthReport/internet-yellow-pages/assets/69568555/2ba9f579-ccb9-4b49-be67-d299e861cceb">
m-appel commented 9 months ago

Thanks for the analysis script, it might be useful for us to get an overview from time to time.

We are aware that these crawlers take a long time, but there is not much we can do about it. For each domain name in the top 10k of any ranking in IYP we fetch the top 100 countries and ASes with the most requests. This results in at least 20k queries to the Cloudflare API and creates ~4.6m relationships in the database, i.e., it's just a lot of data.

I think there is a reason why we do sequential requests, maybe @romain-fontugne knows more.

romain-fontugne commented 9 months ago

yes, the problem comes from Cloudflare API. We have to send them thousands of queries and they are doing rate limiting so we cannot do it very fast.. hopefully one day we'll get a better access to their data.