Question about Generation of Training Data

fuzexin commented 2 years ago

After reading your paper and code in GitHub about CMF model, I generated some training data from GDELT by myself according to the date and location you had given. In the process, I find that my generated dataset(Egypt) has differences with yours. First is about number of positive samples and all data, you can see a huge difference between the tables: In Cairo, the distribution of one-day's target event number is described below, we can see there are no target event records around date index 400~800 in your data： Deng_Mine_Cairo Generate conditions about my data:

date interval from 2017-01-01 to 2019-12-31, achieve by limiting SQLDATE;
set the ActionGeo_ADM1Code to city's code for each city, like Cairo is 'EG11';
positive sample condition is one or more target event(14, protest) record in this day;

In such case, I guess if you have a different way to generate the samples from GDELT, or you did some pre-processing before training? Looking forward to your reply, thanks.

amy-deng commented 2 years ago

Hi, when I process the GDELT data, I removed the events that were missing some entities.

% event_file is the event json file
event_df = pd.read_json(event_file, lines=True,dtype={"EventCode":str})
event_df = event_df.replace('nan',np.nan)
event_df = event_df.loc[(~event_df['Actor1Name'].isna()) & (~event_df['Actor2Name'].isna())]

fuzexin commented 2 years ago

@amy-deng
Thanks a lot, your reply is so fast ~ But actually, I also delete the data which has NULL fields. We get the GDELT data and just store it in a database, updating with GDELT official website every day. for a comparing, we execute sql sentence for querying data like this: select count(*) from gdelt.event where EventDate >='2017-01-01' and EventDate <= '2019-12-31' and ActionGeo_ADM1Code = 'EG11' /*Cairo*/ and Actor1Name != '' and Actor2Name != '' and EventCode != '' its result is 781605. Do you store data in database, if yes, what's your query condition like? If not, what's your filter condition about location and date. It is worthwhile to make the difference of experiment dataset as small as possible between our researchers.

amy-deng commented 2 years ago

Hi @fuzexin I double checked my data.

I first constructed a dataset for a country (e.g., EG, 2017-2019) using the condition 'ActionGeo_CountryCode'=='EG'.
I then use the column 'ActionGeo_Fullname' to filter the events that occur in a city. 'ActionGeo_Fullname' generally consists of city, province, country.

fuzexin commented 2 years ago

@amy-deng Thank you, Doctor Deng, I will try with your instruction.

amy-deng / CMF

Question about Generation of Training Data #1