Closed andybega closed 4 years ago
Looks like the setup of the data on dataverse is changing. Prior to the break in data updating in late 2019, there were two relevant repos:
Currently (2020-05-06), the yearly repo has files for both 2019 and a partial file for 2020; it looks like the previously daily update repo will become a weekly update repo (it's name was changed recently).
Ok, the weekly repo now has contents:
file_list <- dir(find_raw())
file_list
[1] "20200505-icews-events.tab" "events.1995.20150313082510.tab" "events.1996.20150313082528.tab"
[4] "events.1997.20150313082554.tab" "events.1998.20150313082622.tab" "events.1999.20150313082705.tab"
[7] "events.2000.20150313082808.tab" "events.2001.20150313082922.tab" "events.2002.20150313083053.tab"
[10] "events.2003.20150313083228.tab" "events.2004.20150313083407.tab" "events.2005.20150313083555.tab"
[13] "events.2006.20150313083752.tab" "events.2007.20150313083959.tab" "events.2008.20150313084156.tab"
[16] "events.2009.20150313084349.tab" "events.2010.20150313084533.tab" "events.2011.20150313084656.tab"
[19] "events.2012.20150313084811.tab" "events.2013.20150313084929.tab" "events.2014.20160121105408.tab"
[22] "events.2015.20180710092545.tab" "events.2016.20180710092843.tab" "events.2017.20180710093300.tab"
[25] "events.2018.20200427084805.tab" "events.2019.20200427085336.tab" "events.2020.20200506093336.tab"
There are duplicate events though.
events <- read_events_tsv(find_raw("20200505-icews-events.tab"))
events2020 <- read_events_tsv(find_raw("events.2020.20200506093336.tab"))
sum(paste0(events$event_id, events$event_date) %in% paste0(events2020$event_id, events2020$event_date))
[1] 43
range(events$event_date)
[1] "2020-04-28" "2020-05-06"
range(events2020$event_date)
[1] "2020-01-01" "2020-04-30"
table(events$event_date < "2020-05-01")
FALSE TRUE
4191 43
The weekly update file has 43 events that have the same event_id and event_date as existing records in the 2020 yearly file.
Are they identical records in every respect?
weekly_duplicates <- events %>% filter(event_date < "2020-05-01")
yearly_duplicates <- events2020 %>% filter(event_id %in% weekly_duplicates$event_id)
nrow(weekly_duplicates)
[1] 43
nrow(yearly_duplicates)
[1] 43
foo = full_join(weekly_duplicates, yearly_duplicates)
Joining, by = c("event_id", "event_date", "source_name", "source_sectors", "source_country", "event_text", "cameo_code", "intensity", "target_name", "target_sectors", "target_country", "story_id", "sentence_number", "publisher", "city", "district", "province", "country", "latitude", "longitude")
nrow(foo)
[1] 43
Yes. Hmm. That's probably going to cause an error.
Unrelated problem there are some references to old daily files lingering in the null_source_files
table. Remove those, they are not needed anymore since those daily files don't exist anymore and were replaced by the events.2018 file.
Confirm only outdated old file references are there:
query_icews("select * from null_source_files;")
name
1 20181004-icews-events.tab
2 20181005-icews-events.tab
3 20181007-icews-events.tab
4 20181008-icews-events.tab
5 20181009-icews-events.tab
6 20181010-icews-events.tab
7 20181011-icews-events.tab
8 20181012-icews-events.tab
9 20181013-icews-events.tab
10 20181014-icews-events.tab
11 20181015-icews-events.tab
12 20181016-icews-events.tab
13 20181017-icews-events.tab
14 20181018-icews-events.tab
15 20181019-icews-events.tab
16 20181020-icews-events.tab
17 20181021-icews-events.tab
18 20181022-icews-events.tab
19 20181023-icews-events.tab
20 20181024-icews-events.tab
21 20181025-icews-events.tab
22 20181026-icews-events.tab
23 20181027-icews-events.tab
24 20181028-icews-events.tab
25 20181029-icews-events.tab
26 20181030-icews-events.tab
27 20190409-icews-events-1.tab
Delete the references:
query_icews("delete from null_source_files;")
(This produces a warning, but it has worked.)
sync_db_with_files()
Deleting DB records from 'events.2020.20200427085547.tab'
Ingesting records from '20200505-icews-events.tab'
Ingesting records from 'events.2020.20200506093336.tab'
Error: UNIQUE constraint failed: events.event_id, events.event_date
Ok, the problem was that the plan tried to ingest the weekly file before the yearly file, and then threw an error because of the duplicates in the yearly file. I already had a fix for this that checks whether the daily (weekly) file contains duplicates and then throws those out. But that depends on ingesting records from a yearly file before ingesting records from weekly files. I changed the plan sorting in plan_database_changes()
and plan_file_changes()
, and that seems to have fixed the issue.
It did require manual messing with the DB though since "20200505-icews-events.tab" had already been ingested.
query_icews("delete from events where source_file = '20200505-icews-events.tab';")
# make sure the table tracking ingested files is updated
icews:::update_stats()
I should maybe add a vignette that goes over the DB internal and how things are organized.
Ok, I've gone through several update cycles and this seems to be working correctly. For posterity, the only thing that should be required to transition when icews was setup prior to ~May 2020, when ICEWS had the old structure with daily updates, is to delete harmless references to old null source files:
query_icews("delete from null_source_files;")
For a fresh setup since May 2020, without any local artifacts reflecting the old yearly/daily repo structure, nothing should be required.
ICEWS data on dataverse are being updated again. When I try to sync my local copy with the new updates, I get an error: