InternetHealthReport / internet-yellow-pages

A knowledge graph for Internet resources
GNU General Public License v3.0
39 stars 16 forks source link

Add `expire_after` to CachedSession #114

Closed m-appel closed 8 months ago

m-appel commented 8 months ago

In places where we use requests_cache.CachedSession (currently only in the PeeringDB crawlers), we should add the expire_after parameter. By default, responses are cached indefinitely (see here), so if we do not properly delete the cache between our weekly iterations, we are actually always recycling the first result.

I guess a suitable value would be six days, so that we can rerun the crawler(s) if necessary, but are guaranteed to get fresh data in our weekly interval. This can be achieved with timedelta:

from datetime import timedelta

session = requests_cache.CachedSession(expire_after=timedelta(days=6))
MAVRICK-1 commented 8 months ago

i had modifiied the code and provided a snipped for fac.py can you review it , is that right or wrong?

import argparse
import json
import logging
import os
import sys
from datetime import timedelta  # Added import

import flatdict
import iso3166
import requests_cache

from iyp import BaseCrawler
from iyp.crawlers.peeringdb.ix import handle_social_media

# ... (existing code)

# Add this block to set up CachedSession with expire_after
expire_after_days = 6
cache_expire_time = timedelta(days=expire_after_days)
session = requests_cache.CachedSession(expire_after=cache_expire_time)

class Crawler(BaseCrawler):
    def _init_(self, organization, url, name):
        """Initialization for pushing PeeringDB facilities to IYP."""

        self.headers = {'Authorization': 'Api-Key ' + API_KEY}
        self.requests = session  # Use the session with expire_after

        super()._init_(organization, url, name)

    # ... (existing code)
MAVRICK-1 commented 8 months ago

@m-appel

m-appel commented 8 months ago

This looks almost correct, but I would place your code block inside the __init__ function to not create a global variable. And (personally) I would not create two variables for the number of days and the timedelta since expire_after=timedelta(days=6) already seems expressive enough for me, but that's just personal taste.

MAVRICK-1 commented 8 months ago

@m-appel thankyou for the review 😊, next will keep in mind