Closed mohamedawnallah closed 9 months ago
Thanks for the analysis script, it might be useful for us to get an overview from time to time.
We are aware that these crawlers take a long time, but there is not much we can do about it. For each domain name in the top 10k of any ranking in IYP we fetch the top 100 countries and ASes with the most requests. This results in at least 20k queries to the Cloudflare API and creates ~4.6m relationships in the database, i.e., it's just a lot of data.
I think there is a reason why we do sequential requests, maybe @romain-fontugne knows more.
yes, the problem comes from Cloudflare API. We have to send them thousands of queries and they are doing rate limiting so we cannot do it very fast.. hopefully one day we'll get a better access to their data.
Describe the bug The indexing process currently faces an issue where two specific crawlers,
iyp.crawlers.cloudflare.dns_top_locations
andiyp.crawlers.cloudflare.dns_top_ases
, consume an excessive amount of time, representing over 83% of the total indexing time. The former accounts for 43.07%, while the latter contributes 40.70%.To Reproduce
iyp-2023-12-01.log
file from the following link: https://exp1.iijlab.net/wip/iyp/dumps/2023/12/01/iyp-2023-12-01.logLog file path
log_file = "iyp-2023-12-01.log”
Initialize variables
processes = {} total_time = 0
Read log file and process lines
with open(log_file, "r") as file: lines = file.readlines() for line in lines:
Extract timestamp and process name
Filter out processes without duration information
processes = {process: info for process, info in processes.items() if "duration" in info}
Calculate the contribution of each process to the total time
contribution = { process: (info["duration"] / total_time) * 100 for process, info in processes.items() }
Visualize contributions
plt.figure(figsize=(10, 6)) plt.barh(list(contribution.keys()), list(contribution.values()), color=“skyblue”) plt.xlabel("Contribution to Total Time (%)”) plt.title("Process Time Contribution”) plt.gca().invert_yaxis() plt.tight_layout() plt.show()