Closed completelyboofyblitzed closed 1 year ago
Okay, I solved it by changing the f.query_params
@completelyboofyblitzed I also encountered this problem, could you please explain in detail how to solve it? Thanks!
@FishStick438 Well, I did it once, very unoptimally 👉👈
# I initialized the object
f = Filters(
start_date = <start_day>,
end_date = <end_day>, # an unused variable in my case, just to initialize the object
country = "US",
)
# then I iterated over the half hours of the day, since it's the smallest time unit possible
for i in range(0, 24*2):
# I converted the hour to the format required by the API
end_hour = (<start_day_datetime> + timedelta(minutes=31)).strftime("%Y%m%d%H%M%S")
# I placed the hour in the query format required by the API
end_query_param = f"&enddatetime={end_hour}"
# I replaced the end_date query parameter with a new hour
f.query_params[2] = end_query_param
# I got the articles for that hour :D
halfhour_articles = gd.article_search(f)
The subjects that I'm looking for have a very wide variety of news articles, so for efficiency, I used your method to split the date range in case it goes over the limit (down to a minimum of half an hour required by GDELT).
This is the code:
def fetch_gdelt_lines_for_range_iterative(query, start_date, end_date, db_name):
"""
Fetches GDELT articles for a given query and date range using an iterative approach.
Args:
query (str or tuple or list): The query(s) to search for.
start_date (datetime): The start date to search for articles.
end_date (datetime): The end date to search for articles.
db_name (str): The name of the SQLite database to store the articles in.
"""
logger.info(f"Fetching articles for query(s) {query} from {start_date} to {end_date}")
total_added_entries = 0
total_dupes = 0
min_duration = timedelta(hours=1) # Minimum duration for splitting
# Initialize a stack with the initial date range
date_ranges = [(start_date, end_date)]
while date_ranges:
current_start, current_end = date_ranges.pop()
# Prepare the filters for the current date range
gd = GdeltDoc()
filters = Filters(
keyword=query,
start_date=current_start.strftime('%Y%m%d'),
end_date=current_end.strftime('%Y%m%d')
)
# Directly alter the query params
start_query_param = f"&startdatetime={current_start.strftime('%Y%m%d%H%M%S')}"
filters.query_params[1] = start_query_param
end_query_param = f"&enddatetime={current_end.strftime('%Y%m%d%H%M%S')}"
filters.query_params[2] = end_query_param
try:
# Anti-ratelimit clock
gdelt_ratelimit()
# Perform the search
df_articles = gd.article_search(filters=filters)
if len(df_articles) == 250:
# If we hit the limit, split the range and search each half
duration = current_end - current_start
if duration <= min_duration:
logger.info(f"Minimum query window reached: {current_start} to {current_end}")
added_entries, dupes = process_and_store_articles(df_articles, db_name)
total_added_entries += added_entries
total_dupes += dupes
continue
mid_date = current_start + duration / 2
mid_date = mid_date.replace(second=0, microsecond=0) # Ensure clean split
if (mid_date - current_start) < min_duration:
mid_date = current_start + min_duration
if (current_end - mid_date) < min_duration:
mid_date = current_end - min_duration
logger.info(f"Maxed out articles (250), splitting range: {current_start} to {mid_date} and {mid_date} to {current_end}")
date_ranges.append((mid_date, current_end))
date_ranges.append((current_start, mid_date))
else:
logger.info(f"Fetched {len(df_articles)} articles for {current_start} to {current_end}")
added_entries, dupes = process_and_store_articles(df_articles, db_name)
total_added_entries += added_entries
total_dupes += dupes
except Exception as e:
logger.error(f"Error fetching articles for {current_start} to {current_end}: {e}")
return total_added_entries, total_dupes
The end result is that it essentially uses a binary search inspired format, which means it's pretty efficient for any subject regardless of how much coverage there is.
Thank you for the project! Is it possible to filter by the exact hours of the specific dates? Something like this:
And if not with this library, do you by chance know if it's solvable?