update_icews error after dataverse update resumption

andybega commented 4 years ago

ICEWS data on dataverse are being updated again. When I try to sync my local copy with the new updates, I get an error:

> update_icews(dryrun = T)
 Error: Tibble columns must have compatible sizes.
* Size 31: Existing data.
* Size 29: Column `category`.
ℹ Only values of size one are recycled.
Run `rlang::last_error()` to see where the error occurred.

andybega commented 4 years ago

Looks like the setup of the data on dataverse is changing. Prior to the break in data updating in late 2019, there were two relevant repos:

https://doi.org/10.7910/DVN/QI2T9A had daily data updates for the latest year
https://doi.org/10.7910/DVN/28075 had historical data, in one file per year

Currently (2020-05-06), the yearly repo has files for both 2019 and a partial file for 2020; it looks like the previously daily update repo will become a weekly update repo (it's name was changed recently).

andybega commented 4 years ago

Ok, the weekly repo now has contents:

file_list <- dir(find_raw())
file_list

 [1] "20200505-icews-events.tab"      "events.1995.20150313082510.tab" "events.1996.20150313082528.tab"
 [4] "events.1997.20150313082554.tab" "events.1998.20150313082622.tab" "events.1999.20150313082705.tab"
 [7] "events.2000.20150313082808.tab" "events.2001.20150313082922.tab" "events.2002.20150313083053.tab"
[10] "events.2003.20150313083228.tab" "events.2004.20150313083407.tab" "events.2005.20150313083555.tab"
[13] "events.2006.20150313083752.tab" "events.2007.20150313083959.tab" "events.2008.20150313084156.tab"
[16] "events.2009.20150313084349.tab" "events.2010.20150313084533.tab" "events.2011.20150313084656.tab"
[19] "events.2012.20150313084811.tab" "events.2013.20150313084929.tab" "events.2014.20160121105408.tab"
[22] "events.2015.20180710092545.tab" "events.2016.20180710092843.tab" "events.2017.20180710093300.tab"
[25] "events.2018.20200427084805.tab" "events.2019.20200427085336.tab" "events.2020.20200506093336.tab"

There are duplicate events though.

events <- read_events_tsv(find_raw("20200505-icews-events.tab"))
events2020 <- read_events_tsv(find_raw("events.2020.20200506093336.tab"))

sum(paste0(events$event_id, events$event_date) %in% paste0(events2020$event_id, events2020$event_date))

[1] 43

range(events$event_date)
[1] "2020-04-28" "2020-05-06"

range(events2020$event_date)
[1] "2020-01-01" "2020-04-30"

table(events$event_date < "2020-05-01")

FALSE  TRUE 
 4191    43

The weekly update file has 43 events that have the same event_id and event_date as existing records in the 2020 yearly file.

andybega commented 4 years ago

Are they identical records in every respect?

weekly_duplicates <- events %>% filter(event_date < "2020-05-01")
yearly_duplicates <- events2020 %>% filter(event_id %in% weekly_duplicates$event_id)

nrow(weekly_duplicates)
[1] 43
nrow(yearly_duplicates)
[1] 43

foo = full_join(weekly_duplicates, yearly_duplicates)
Joining, by = c("event_id", "event_date", "source_name", "source_sectors", "source_country", "event_text", "cameo_code", "intensity", "target_name", "target_sectors", "target_country", "story_id", "sentence_number", "publisher", "city", "district", "province", "country", "latitude", "longitude")
nrow(foo)
[1] 43

Yes. Hmm. That's probably going to cause an error.

andybega commented 4 years ago

Unrelated problem there are some references to old daily files lingering in the null_source_files table. Remove those, they are not needed anymore since those daily files don't exist anymore and were replaced by the events.2018 file.

Confirm only outdated old file references are there:

query_icews("select * from null_source_files;")
                          name
1    20181004-icews-events.tab
2    20181005-icews-events.tab
3    20181007-icews-events.tab
4    20181008-icews-events.tab
5    20181009-icews-events.tab
6    20181010-icews-events.tab
7    20181011-icews-events.tab
8    20181012-icews-events.tab
9    20181013-icews-events.tab
10   20181014-icews-events.tab
11   20181015-icews-events.tab
12   20181016-icews-events.tab
13   20181017-icews-events.tab
14   20181018-icews-events.tab
15   20181019-icews-events.tab
16   20181020-icews-events.tab
17   20181021-icews-events.tab
18   20181022-icews-events.tab
19   20181023-icews-events.tab
20   20181024-icews-events.tab
21   20181025-icews-events.tab
22   20181026-icews-events.tab
23   20181027-icews-events.tab
24   20181028-icews-events.tab
25   20181029-icews-events.tab
26   20181030-icews-events.tab
27 20190409-icews-events-1.tab

Delete the references:

query_icews("delete from null_source_files;")

(This produces a warning, but it has worked.)

andybega commented 4 years ago

sync_db_with_files()
Deleting DB records from 'events.2020.20200427085547.tab'
Ingesting records from '20200505-icews-events.tab'
Ingesting records from 'events.2020.20200506093336.tab'
Error: UNIQUE constraint failed: events.event_id, events.event_date

andybega commented 4 years ago

Ok, the problem was that the plan tried to ingest the weekly file before the yearly file, and then threw an error because of the duplicates in the yearly file. I already had a fix for this that checks whether the daily (weekly) file contains duplicates and then throws those out. But that depends on ingesting records from a yearly file before ingesting records from weekly files. I changed the plan sorting in plan_database_changes() and plan_file_changes(), and that seems to have fixed the issue.

It did require manual messing with the DB though since "20200505-icews-events.tab" had already been ingested.

query_icews("delete from events where source_file = '20200505-icews-events.tab';")
# make sure the table tracking ingested files is updated
icews:::update_stats()

I should maybe add a vignette that goes over the DB internal and how things are organized.

andybega commented 4 years ago

Ok, I've gone through several update cycles and this seems to be working correctly. For posterity, the only thing that should be required to transition when icews was setup prior to ~May 2020, when ICEWS had the old structure with daily updates, is to delete harmless references to old null source files:

query_icews("delete from null_source_files;")

For a fresh setup since May 2020, without any local artifacts reflecting the old yearly/daily repo structure, nothing should be required.

andybega / icews

update_icews error after dataverse update resumption #54