alex9smith / gdelt-doc-api

A Python client for the GDELT 2.0 Doc API
MIT License
91 stars 20 forks source link

`start_date` and `end_date` with an hourly difference #25

Closed completelyboofyblitzed closed 1 year ago

completelyboofyblitzed commented 1 year ago

Thank you for the project! Is it possible to filter by the exact hours of the specific dates? Something like this:

f = Filters(
    country = "US",
    start_date = "2023-03-10-00-00-00",
    end_date = "2023-03-11-01-00-00",
)

And if not with this library, do you by chance know if it's solvable?

completelyboofyblitzed commented 1 year ago

Okay, I solved it by changing the f.query_params

FishStick438 commented 1 year ago

@completelyboofyblitzed I also encountered this problem, could you please explain in detail how to solve it? Thanks!

completelyboofyblitzed commented 1 year ago

@FishStick438 Well, I did it once, very unoptimally 👉👈

# I initialized the object
f = Filters(

    start_date = <start_day>,
    end_date = <end_day>,     # an unused variable in my case, just to initialize the object
    country = "US",
)

# then I iterated over the half hours of the day, since it's the smallest time unit possible
for i in range(0, 24*2):
    # I converted the hour to the format required by the API
    end_hour = (<start_day_datetime> + timedelta(minutes=31)).strftime("%Y%m%d%H%M%S")
    # I placed the hour in the query format required by the API
    end_query_param = f"&enddatetime={end_hour}"
    # I replaced the end_date query parameter with a new hour
    f.query_params[2] = end_query_param
    # I got the articles for that hour :D
    halfhour_articles = gd.article_search(f)
Swatsicle commented 1 month ago

The subjects that I'm looking for have a very wide variety of news articles, so for efficiency, I used your method to split the date range in case it goes over the limit (down to a minimum of half an hour required by GDELT).

This is the code:

def fetch_gdelt_lines_for_range_iterative(query, start_date, end_date, db_name):
    """
    Fetches GDELT articles for a given query and date range using an iterative approach.

    Args:
        query (str or tuple or list): The query(s) to search for.
        start_date (datetime): The start date to search for articles.
        end_date (datetime): The end date to search for articles.
        db_name (str): The name of the SQLite database to store the articles in.
    """

    logger.info(f"Fetching articles for query(s) {query} from {start_date} to {end_date}")

    total_added_entries = 0
    total_dupes = 0
    min_duration = timedelta(hours=1)  # Minimum duration for splitting

    # Initialize a stack with the initial date range
    date_ranges = [(start_date, end_date)]

    while date_ranges:
        current_start, current_end = date_ranges.pop()

        # Prepare the filters for the current date range
        gd = GdeltDoc()
        filters = Filters(
            keyword=query,
            start_date=current_start.strftime('%Y%m%d'),
            end_date=current_end.strftime('%Y%m%d')
        )

        # Directly alter the query params
        start_query_param = f"&startdatetime={current_start.strftime('%Y%m%d%H%M%S')}"
        filters.query_params[1] = start_query_param

        end_query_param = f"&enddatetime={current_end.strftime('%Y%m%d%H%M%S')}"
        filters.query_params[2] = end_query_param

        try:
            # Anti-ratelimit clock
            gdelt_ratelimit()

            # Perform the search
            df_articles = gd.article_search(filters=filters)

            if len(df_articles) == 250:
                # If we hit the limit, split the range and search each half
                duration = current_end - current_start

                if duration <= min_duration:
                    logger.info(f"Minimum query window reached: {current_start} to {current_end}")
                    added_entries, dupes = process_and_store_articles(df_articles, db_name)
                    total_added_entries += added_entries
                    total_dupes += dupes
                    continue

                mid_date = current_start + duration / 2
                mid_date = mid_date.replace(second=0, microsecond=0)  # Ensure clean split

                if (mid_date - current_start) < min_duration:
                    mid_date = current_start + min_duration

                if (current_end - mid_date) < min_duration:
                    mid_date = current_end - min_duration

                logger.info(f"Maxed out articles (250), splitting range: {current_start} to {mid_date} and {mid_date} to {current_end}")
                date_ranges.append((mid_date, current_end))
                date_ranges.append((current_start, mid_date))
            else:
                logger.info(f"Fetched {len(df_articles)} articles for {current_start} to {current_end}")
                added_entries, dupes = process_and_store_articles(df_articles, db_name)
                total_added_entries += added_entries
                total_dupes += dupes

        except Exception as e:
            logger.error(f"Error fetching articles for {current_start} to {current_end}: {e}")

    return total_added_entries, total_dupes

The end result is that it essentially uses a binary search inspired format, which means it's pretty efficient for any subject regardless of how much coverage there is.