Open fuzexin opened 2 years ago
Hi, when I process the GDELT data, I removed the events that were missing some entities.
% event_file is the event json file
event_df = pd.read_json(event_file, lines=True,dtype={"EventCode":str})
event_df = event_df.replace('nan',np.nan)
event_df = event_df.loc[(~event_df['Actor1Name'].isna()) & (~event_df['Actor2Name'].isna())]
@amy-deng
Thanks a lot, your reply is so fast ~ But actually, I also delete the data which has NULL fields. We get the GDELT data and just store it in a database, updating with GDELT official website every day.
for a comparing, we execute sql sentence for querying data like this:
select count(*) from gdelt.event where EventDate >='2017-01-01' and EventDate <= '2019-12-31' and ActionGeo_ADM1Code = 'EG11' /*Cairo*/ and Actor1Name != '' and Actor2Name != '' and EventCode != ''
its result is 781605.
Do you store data in database, if yes, what's your query condition like? If not, what's your filter condition about location and date.
It is worthwhile to make the difference of experiment dataset as small as possible between our researchers.
Hi @fuzexin I double checked my data.
@amy-deng Thank you, Doctor Deng, I will try with your instruction.
After reading your paper and code in GitHub about CMF model, I generated some training data from GDELT by myself according to the date and location you had given. In the process, I find that my generated dataset(Egypt) has differences with yours. First is about number of positive samples and all data, you can see a huge difference between the tables: In Cairo, the distribution of one-day's target event number is described below, we can see there are no target event records around date index 400~800 in your data: Generate conditions about my data:
In such case, I guess if you have a different way to generate the samples from GDELT, or you did some pre-processing before training? Looking forward to your reply, thanks.