duplicate file names and event data in source Dataverse repository

mayeulk commented 5 years ago

After a fresh install on Ubuntu 18.04, the following fails after downloading 151 files (73.1 MB) with an error:

library("icews")
library("DBI")
library("dplyr")
library("usethis")
setup_icews(data_dir = "/home/mk/Documents/data/icews", use_db = TRUE, keep_files = TRUE,  r_profile = TRUE)

update_icews(dryrun = TRUE)
update_icews(dryrun = FALSE)

# (...... downloads 151 files, ingesting correctly 294687 rows in sqlite database)
Downloading '20190309-icews-events.zip'
Error in writeBin(as.vector(f), tmp) : can only write vector objects

Launching update_icews(dryrun = FALSE) again and again does not solve the issue.

The following (launched after the error) might help:

> update_icews(dryrun = TRUE)
File system changes:
Found 151 local data file(s)
Downloading 84 file(s)
Removing 0 old file(s)

Database changes:
Deleting old records for 0 file(s)
Ingesting records from 84 file(s)

Plan:
Download            '20190309-icews-events.zip'
Download            '20190309-icews-events.zip'
Ingest records from '20190309-icews-events.tab'
Ingest records from '20190309-icews-events.tab'
Download            '20190311-icews-events.zip'
Ingest records from '20190311-icews-events.tab'
Download            '20190312-icews-events.zip'
Ingest records from '20190312-icews-events.tab'
Download            '20190313-icews-events.zip'
Ingest records from '20190313-icews-events.tab'
Download            '20190314-icews-events.zip'
(etc.)

andybega commented 5 years ago

For some reason there are two files named "20190309-icews-events.zip" in the dataverse repo (they don't contain the same events). The duplicate file names are causing issues, e.g. the download right now is by the file name/label. For me this is causing a timeout when it tries to download the file, but I'm guessing it's also what leads to the error you are getting. I'll have to change how the files are downloaded and labelled for the database.

mayeulk commented 5 years ago

Looking at the metatdata of the 2 files, they have at least four attributes that differ, that may help differentiate them: https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/QI2T9A/DER3I5&version=200.0 https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/QI2T9A/5DCQQF&version=200.0

Those 4 things are:

the MD5
the last part of the direct link: DER3I5 vs 5DCQQF
the deposit date: 2019-03-10 vs 2019-03-11
the first ICEWS version in which they appear:

Those might help to form a filename if needed. Hope this helps.

andybega commented 5 years ago

Hey, thank you. I've been (and still am) on vacation, but had a change to look a bit more. Gonna paste this here, partly for myself. Aside from the repeated 20190309-icews-events.zip, there's also 20190409-icews-events-1.zip and 20190410-icews-events-1.zip, as well as 20190503-thru-20190519-icews-events.zip.

The files are downloaded using the dataverse packages. Looking at get_file, which does the actual downloading, it seems that it is also possible to use a numeric ID instead of the file name. E.g.:

library("icews")
library("dataverse")

file_list = get_dataset(get_doi()$daily)
head(file_list$files[, c("label", "id")])

label      id
1 20181004-icews-events.zip 3234868
2 20181005-icews-events.zip 3235021
3 20181006-icews-events.zip 3238491
4 20181007-icews-events.zip 3238493
5 20181008-icews-events.zip 3238584
6 20181009-icews-events.zip 3238918

Right now the file name is used to reconcile the local and remote states, so that will have to switch. There are two tables, source_files and null_source_files, that list the ingested source files (and source files that contained no new events and thus wouldn't show up in the events table source_file column).

library("RSQLite")
con = connect()
dbGetQuery(con, "select * from source_files limit 5;")
dbGetQuery(con, "select * from null_source_files limit 5;")

The source file is also included in the events table (i.e. query_icews("select * from events limit 5;") %>% str() will show a source_file column at the end), but that's more for informative purposes since querying the events table at every update would take a really long time.

I'm going to have to switch those internal tables up. Maybe have a new source_file table with something like

source_file_id: source file integer ID from dataverse
source_file_name: the possible duplicated source file name
source_id: maybe another ID for this table since the dataverse source file id is a pretty big integer. Not sure.

The various state and downloader functions will need to be switched to use the integer ID instead of file name.

There's probably also going to have to be some kind of one-time upgrade functionality that implements these changes on an existing table, to avoid having to nuke and re-download everything.

Well, lesson for me to not use file names as unique IDs when there's already a perfectly good unique ID on dataverse. I hope to get to this at the end of this week or next week.

andybega commented 5 years ago

Make DVN to local files work again:

[x] create file name normalizer
[x] create dictionary with dataverse file labels and IDs, as well as normalized local file names
[x] switch get_dvn_state to new state format
[x] switch get_local_state to new state format
[x] update plan_file_changes
[x] update execute_plan to use file ID for download

Make local files to DB work again:

[x] switch get_db_state to new state format
[x] update plan_database_sync for local file to DB steps (without downloading new files)
[x] update plan_database_changes for DVN to local file to DB steps

Misc

[x] check documentation downloader still works

andybega commented 5 years ago

@mayeulk can you try updating the package and seeing if it works now?

mayeulk commented 5 years ago

Hi, It seems to work, also I could not finish the process (full disk):```

Ingesting records from 'events.2015.20180710092545.tab' |================================================================================================================================| 100% 241 MB Error in result_bind(res@ptr, params) : database or disk is full Error in result_create(conn@ptr, statement) : no such savepoint: dbWriteTable


My previous attempts led to db of 120 MB, this one: 6GB
I'll confirm in a few days (getting a new disk).
Thanks! Cheers,
Mayeul

andybega commented 5 years ago

😀 that sounds about right. I just updated all the way through 15 June, and have ~8GB for the database and ~5GB for the raw ".tsv" files.

mayeulk commented 5 years ago

Hi, it repeatedly fails now on '20190409-icews-events.zip' and '20190409-icews-events-1.zip'

Downloading '20190406-icews-events.zip'
Ingesting records from '20190406-icews-events.tab'
Downloading '20190407-icews-events.zip'
Ingesting records from '20190407-icews-events.tab'
Downloading '20190408-icews-events.zip'
Ingesting records from '20190408-icews-events.tab'
Downloading '20190409-icews-events.zip'
Ingesting records from '20190409-icews-events.tab'
Downloading '20190409-icews-events-1.zip'
Ingesting records from '20190409-icews-events-1.tab'
Error in result_bind(res@ptr, params) : 
  UNIQUE constraint failed: events.event_id, events.event_date
> date()
[1] "Wed Jun 19 18:59:23 2019"
> update_icews(dryrun = FALSE); date()
Downloading '20190409-icews-events.zip'
Ingesting records from '20190409-icews-events-1.tab'
Error in result_bind(res@ptr, params) : 
  UNIQUE constraint failed: events.event_id, events.event_date
>

mayeulk commented 5 years ago

Maybe related to this, '@icews' twiteer feed mentions duplicates, see:

https://twitter.com/icews?lang=en Direct link to tweet Apr 7, 2019: https://t.co/W3XSnPU0Vo

ICEWS ‏ @icews Apr 7 We uploaded the 3 missing events file for Mar 26-28. Unfortunately, there will be duplicate events in these files and files from Mar 27-Apr 6, predominantly affecting Mar 28 and Apr 5. Use the Event ID field to identify these duplicates. http://bit.ly/2ORIfQX #icews #dataverse

mayeulk commented 5 years ago

I lowered the unicity requirement (primary key) as a quick, temporary fix, running this against the sqlite database:

CREATE TABLE events_copy ( event_id INTEGER NOT NULL, event_date INTEGER NOT NULL,
 source_name TEXT, source_sectors TEXT, source_country TEXT, event_text TEXT, cameo_code TEXT, 
intensity REAL, target_name TEXT, target_sectors TEXT, target_country TEXT, story_id INTEGER, 
sentence_number INTEGER, publisher TEXT, city TEXT, district TEXT, province TEXT, country TEXT, 
latitude REAL, longitude REAL, year INTEGER NOT NULL, yearmonth INTEGER NOT NULL,
 source_file TEXT NOT NULL, PRIMARY KEY (event_id, event_date, source_file) );

INSERT INTO events_copy
   SELECT * FROM events;

ALTER TABLE events RENAME TO events_old;
ALTER TABLE events_copy RENAME TO events;

I kept the old data to analyse this. Running again update_icews(dryrun = FALSE) now works on this particular setup.

mayeulk commented 5 years ago

With my version of the database (with PRIMARY KEY (event_id, event_date, source_file) ), it now works:

Ingesting records from '20190618-icews-events.tab'
Cleaning up and optimizing database, this might take a while
Complete
File and/or database update done

On this new db, I ran the following to find duplicates:

SELECT event_id, event_date, COUNT(*)
FROM events
GROUP BY event_id, event_date
HAVING COUNT(*) > 1

There are 2434 rows returned ("duplicates"), of which: 2319 rows are for event_date == 20190409 77 rows are for event_date == 20190410 38 rows are for event_date == 20190408

andybega commented 5 years ago

I had the exact same issue with "20190409-icews-events-1.zip" and "20190409-icews-events.zip", thought I had managed to fix it (https://github.com/andybega/icews/issues/46).

The two files contain the same exact set of events (by event ID), so what should happen is this:

The version without "-1" is ingested first, and all events are added to the database.
For the next version with "-1", write_data_to_db() should have recognized that there are potentially duplicate events, realized that all events are duplicates and thus there is nothing to add, in which case it should have added the "-1" file name to a table tracking source files with no new events (i.e. all duplicates).

Could you check if you get the same results for these queries?:

SELECT name FROM null_source_files WHERE name LIKE '20190409%';

"20190409-icews-events-1.tab" only

SELECT name FROM source_files WHERE name LIKE '20190409%';

Both "20190409-icews-events.tab" and "20190409-icews-events-1.tab".

SELECT source_file, count(*) AS n_events 
FROM events 
WHERE source_file LIKE '20190409%' 
GROUP BY source_file;

All from the file version without "-1":

                source_file n_events
1 20190409-icews-events.tab     2434

mayeulk commented 5 years ago

Here are the results (ran on my sqlite db with duplicates):

SELECT name FROM null_source_files WHERE name LIKE '20190409%'; 0 rows returned

SELECT name FROM source_files WHERE name LIKE '20190409%';

"20190409-icews-events-1.tab" "20190409-icews-events.tab" 2 rows


SELECT source_file, count(*) AS n_events 
FROM events 
WHERE source_file LIKE '20190409%' 
GROUP BY source_file;

```----
source_file                     n_events
"20190409-icews-events-1.tab"   "2434"
"20190409-icews-events.tab" "2434"
----
2 rows

mayeulk commented 5 years ago

I guess we can think of ways to remove duplicates in sql, which might be faster than in R (or not).

mayeulk commented 5 years ago

Here, I implement in sql a solution to remove the duplicates linked to this issue. For speed of queries in this testing phase, we do not make a full copy of the db but only of a recent subset here, into table events_extract

CREATE TABLE events_extract  ( event_id INTEGER NOT NULL, event_date INTEGER NOT NULL,
 source_name TEXT, source_sectors TEXT, source_country TEXT, event_text TEXT, cameo_code TEXT, 
intensity REAL, target_name TEXT, target_sectors TEXT, target_country TEXT, story_id INTEGER, 
sentence_number INTEGER, publisher TEXT, city TEXT, district TEXT, province TEXT, country TEXT, 
latitude REAL, longitude REAL, year INTEGER NOT NULL, yearmonth INTEGER NOT NULL,
 source_file TEXT NOT NULL, PRIMARY KEY (event_id, event_date, source_file) );

INSERT INTO events_extract
   SELECT * FROM events where event_date>20190400;

-- The duplicates to fix
SELECT event_id, event_date, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) > 1
-- 2434 rows returned 

-- Same, without showing the count
SELECT event_id, event_date --, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) > 1

-- Full rows of duplicates
SELECT * FROM events_extract where event_date || '-' || event_id IN
(SELECT event_date || '-' || event_id --, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) >1)
ORDER by event_date, event_id, source_file;
-- 4868 rows returned
-- we see that all duplicates are pairs coming from two files: -icews-events.tab and -icews-events-1.tab

-- From those duplicates, select only those loaded from -icews-events-1.tab
SELECT * FROM events_extract where event_date || '-' || event_id IN
(SELECT event_date || '-' || event_id --, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) >1)
AND source_file LIKE '%-icews-events-1.tab'
ORDER by event_date, event_id, source_file;
-- 2434 rows returned

-- DELETE the duplicates
DELETE FROM events_extract where event_date || '-' || event_id IN
(SELECT event_date || '-' || event_id --, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) >1)
AND source_file LIKE '%-icews-events-1.tab';
-- 2434 rows affected

-- Check there are no duplicates left
SELECT event_id, event_date, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) > 1;
0 rows returned

mayeulk commented 5 years ago

Ran over the full events table, the DELETE SQL query takes 15 s on my laptop. took 15465ms, 2434 rows affected

mayeulk commented 5 years ago

Running now the update function puts back the dupes from the 20190409 file:

# R code
> update_icews(dryrun = FALSE)
Ingesting records from '20190409-icews-events-1.tab'
Downloading '20190622-icews-events.zip'
Ingesting records from '20190622-icews-events.tab'
Cleaning up and optimizing database, this might take a while

-- SQL code
-- The duplicates
SELECT event_id, event_date, COUNT(*)
FROM events
GROUP BY event_id, event_date
HAVING COUNT(*) > 1
-- 2434 rows returned

mayeulk commented 5 years ago

I've changed the title to something more readable. I believe there are two options here:

fix the R icews package
fix the data in the Icews Harvard Dataverse repository. I'm not sure the data authors would be willing to do so, but it is certainly a better option, among other things because other projects (not in R) might want to use the data, too.

andybega commented 4 years ago

Hi, given the recent changes in ICEWS dataverse, I think this is not an issue anymore. Duplicate events are still a problem, but that should be taken care of when ingesting new data. Have you tried updating the data recently?

(It should work even with the previous data present, but might give some essentially ineffectual messages, https://github.com/andybega/icews/issues/54#issuecomment-637425359)

andybega / icews

duplicate file names and event data in source Dataverse repository #45