Closed mayeulk closed 2 years ago
For some reason there are two files named "20190309-icews-events.zip" in the dataverse repo (they don't contain the same events). The duplicate file names are causing issues, e.g. the download right now is by the file name/label. For me this is causing a timeout when it tries to download the file, but I'm guessing it's also what leads to the error you are getting. I'll have to change how the files are downloaded and labelled for the database.
Looking at the metatdata of the 2 files, they have at least four attributes that differ, that may help differentiate them: https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/QI2T9A/DER3I5&version=200.0 https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/QI2T9A/5DCQQF&version=200.0
Those 4 things are:
Those might help to form a filename if needed. Hope this helps.
Hey, thank you. I've been (and still am) on vacation, but had a change to look a bit more. Gonna paste this here, partly for myself. Aside from the repeated 20190309-icews-events.zip
, there's also 20190409-icews-events-1.zip
and 20190410-icews-events-1.zip
, as well as 20190503-thru-20190519-icews-events.zip
.
The files are downloaded using the dataverse
packages. Looking at get_file
, which does the actual downloading, it seems that it is also possible to use a numeric ID instead of the file name. E.g.:
library("icews")
library("dataverse")
file_list = get_dataset(get_doi()$daily)
head(file_list$files[, c("label", "id")])
label id
1 20181004-icews-events.zip 3234868
2 20181005-icews-events.zip 3235021
3 20181006-icews-events.zip 3238491
4 20181007-icews-events.zip 3238493
5 20181008-icews-events.zip 3238584
6 20181009-icews-events.zip 3238918
Right now the file name is used to reconcile the local and remote states, so that will have to switch. There are two tables, source_files
and null_source_files
, that list the ingested source files (and source files that contained no new events and thus wouldn't show up in the events
table source_file
column).
library("RSQLite")
con = connect()
dbGetQuery(con, "select * from source_files limit 5;")
dbGetQuery(con, "select * from null_source_files limit 5;")
The source file is also included in the events table (i.e. query_icews("select * from events limit 5;") %>% str()
will show a source_file
column at the end), but that's more for informative purposes since querying the events
table at every update would take a really long time.
I'm going to have to switch those internal tables up. Maybe have a new source_file table with something like
The various state and downloader functions will need to be switched to use the integer ID instead of file name.
There's probably also going to have to be some kind of one-time upgrade functionality that implements these changes on an existing table, to avoid having to nuke and re-download everything.
Well, lesson for me to not use file names as unique IDs when there's already a perfectly good unique ID on dataverse. I hope to get to this at the end of this week or next week.
Make DVN to local files work again:
get_dvn_state
to new state formatget_local_state
to new state formatplan_file_changes
execute_plan
to use file ID for downloadMake local files to DB work again:
get_db_state
to new state formatplan_database_sync
for local file to DB steps (without downloading new files)plan_database_changes
for DVN to local file to DB stepsMisc
@mayeulk can you try updating the package and seeing if it works now?
Hi, It seems to work, also I could not finish the process (full disk):```
Ingesting records from 'events.2015.20180710092545.tab' |================================================================================================================================| 100% 241 MB Error in result_bind(res@ptr, params) : database or disk is full Error in result_create(conn@ptr, statement) : no such savepoint: dbWriteTable
My previous attempts led to db of 120 MB, this one: 6GB
I'll confirm in a few days (getting a new disk).
Thanks! Cheers,
Mayeul
😀 that sounds about right. I just updated all the way through 15 June, and have ~8GB for the database and ~5GB for the raw ".tsv" files.
Hi, it repeatedly fails now on '20190409-icews-events.zip' and '20190409-icews-events-1.zip'
Downloading '20190406-icews-events.zip'
Ingesting records from '20190406-icews-events.tab'
Downloading '20190407-icews-events.zip'
Ingesting records from '20190407-icews-events.tab'
Downloading '20190408-icews-events.zip'
Ingesting records from '20190408-icews-events.tab'
Downloading '20190409-icews-events.zip'
Ingesting records from '20190409-icews-events.tab'
Downloading '20190409-icews-events-1.zip'
Ingesting records from '20190409-icews-events-1.tab'
Error in result_bind(res@ptr, params) :
UNIQUE constraint failed: events.event_id, events.event_date
> date()
[1] "Wed Jun 19 18:59:23 2019"
> update_icews(dryrun = FALSE); date()
Downloading '20190409-icews-events.zip'
Ingesting records from '20190409-icews-events-1.tab'
Error in result_bind(res@ptr, params) :
UNIQUE constraint failed: events.event_id, events.event_date
>
Maybe related to this, '@icews' twiteer feed mentions duplicates, see:
https://twitter.com/icews?lang=en Direct link to tweet Apr 7, 2019: https://t.co/W3XSnPU0Vo
ICEWS @icews Apr 7 We uploaded the 3 missing events file for Mar 26-28. Unfortunately, there will be duplicate events in these files and files from Mar 27-Apr 6, predominantly affecting Mar 28 and Apr 5. Use the Event ID field to identify these duplicates. http://bit.ly/2ORIfQX #icews #dataverse
I lowered the unicity requirement (primary key) as a quick, temporary fix, running this against the sqlite database:
CREATE TABLE events_copy ( event_id INTEGER NOT NULL, event_date INTEGER NOT NULL,
source_name TEXT, source_sectors TEXT, source_country TEXT, event_text TEXT, cameo_code TEXT,
intensity REAL, target_name TEXT, target_sectors TEXT, target_country TEXT, story_id INTEGER,
sentence_number INTEGER, publisher TEXT, city TEXT, district TEXT, province TEXT, country TEXT,
latitude REAL, longitude REAL, year INTEGER NOT NULL, yearmonth INTEGER NOT NULL,
source_file TEXT NOT NULL, PRIMARY KEY (event_id, event_date, source_file) );
INSERT INTO events_copy
SELECT * FROM events;
ALTER TABLE events RENAME TO events_old;
ALTER TABLE events_copy RENAME TO events;
I kept the old data to analyse this. Running again update_icews(dryrun = FALSE)
now works on this particular setup.
With my version of the database (with PRIMARY KEY (event_id, event_date, source_file)
), it now works:
Ingesting records from '20190618-icews-events.tab'
Cleaning up and optimizing database, this might take a while
Complete
File and/or database update done
On this new db, I ran the following to find duplicates:
SELECT event_id, event_date, COUNT(*)
FROM events
GROUP BY event_id, event_date
HAVING COUNT(*) > 1
There are 2434 rows returned ("duplicates"), of which: 2319 rows are for event_date == 20190409 77 rows are for event_date == 20190410 38 rows are for event_date == 20190408
I had the exact same issue with "20190409-icews-events-1.zip" and "20190409-icews-events.zip", thought I had managed to fix it (https://github.com/andybega/icews/issues/46).
The two files contain the same exact set of events (by event ID), so what should happen is this:
write_data_to_db()
should have recognized that there are potentially duplicate events, realized that all events are duplicates and thus there is nothing to add, in which case it should have added the "-1" file name to a table tracking source files with no new events (i.e. all duplicates). Could you check if you get the same results for these queries?:
SELECT name FROM null_source_files WHERE name LIKE '20190409%';
"20190409-icews-events-1.tab" only
SELECT name FROM source_files WHERE name LIKE '20190409%';
Both "20190409-icews-events.tab" and "20190409-icews-events-1.tab".
SELECT source_file, count(*) AS n_events
FROM events
WHERE source_file LIKE '20190409%'
GROUP BY source_file;
All from the file version without "-1":
source_file n_events
1 20190409-icews-events.tab 2434
Here are the results (ran on my sqlite db with duplicates):
SELECT name FROM null_source_files WHERE name LIKE '20190409%';
0 rows returned
SELECT name FROM source_files WHERE name LIKE '20190409%';
"20190409-icews-events-1.tab" "20190409-icews-events.tab" 2 rows
SELECT source_file, count(*) AS n_events
FROM events
WHERE source_file LIKE '20190409%'
GROUP BY source_file;
```----
source_file n_events
"20190409-icews-events-1.tab" "2434"
"20190409-icews-events.tab" "2434"
----
2 rows
I guess we can think of ways to remove duplicates in sql, which might be faster than in R (or not).
Here, I implement in sql a solution to remove the duplicates linked to this issue. For speed of queries in this testing phase, we do not make a full copy of the db but only of a recent subset here, into table events_extract
CREATE TABLE events_extract ( event_id INTEGER NOT NULL, event_date INTEGER NOT NULL,
source_name TEXT, source_sectors TEXT, source_country TEXT, event_text TEXT, cameo_code TEXT,
intensity REAL, target_name TEXT, target_sectors TEXT, target_country TEXT, story_id INTEGER,
sentence_number INTEGER, publisher TEXT, city TEXT, district TEXT, province TEXT, country TEXT,
latitude REAL, longitude REAL, year INTEGER NOT NULL, yearmonth INTEGER NOT NULL,
source_file TEXT NOT NULL, PRIMARY KEY (event_id, event_date, source_file) );
INSERT INTO events_extract
SELECT * FROM events where event_date>20190400;
-- The duplicates to fix
SELECT event_id, event_date, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) > 1
-- 2434 rows returned
-- Same, without showing the count
SELECT event_id, event_date --, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) > 1
-- Full rows of duplicates
SELECT * FROM events_extract where event_date || '-' || event_id IN
(SELECT event_date || '-' || event_id --, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) >1)
ORDER by event_date, event_id, source_file;
-- 4868 rows returned
-- we see that all duplicates are pairs coming from two files: -icews-events.tab and -icews-events-1.tab
-- From those duplicates, select only those loaded from -icews-events-1.tab
SELECT * FROM events_extract where event_date || '-' || event_id IN
(SELECT event_date || '-' || event_id --, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) >1)
AND source_file LIKE '%-icews-events-1.tab'
ORDER by event_date, event_id, source_file;
-- 2434 rows returned
-- DELETE the duplicates
DELETE FROM events_extract where event_date || '-' || event_id IN
(SELECT event_date || '-' || event_id --, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) >1)
AND source_file LIKE '%-icews-events-1.tab';
-- 2434 rows affected
-- Check there are no duplicates left
SELECT event_id, event_date, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) > 1;
0 rows returned
Ran over the full events table, the DELETE SQL query takes 15 s on my laptop.
took 15465ms, 2434 rows affected
Running now the update function puts back the dupes from the 20190409 file:
# R code
> update_icews(dryrun = FALSE)
Ingesting records from '20190409-icews-events-1.tab'
Downloading '20190622-icews-events.zip'
Ingesting records from '20190622-icews-events.tab'
Cleaning up and optimizing database, this might take a while
-- SQL code
-- The duplicates
SELECT event_id, event_date, COUNT(*)
FROM events
GROUP BY event_id, event_date
HAVING COUNT(*) > 1
-- 2434 rows returned
I've changed the title to something more readable. I believe there are two options here:
Hi, given the recent changes in ICEWS dataverse, I think this is not an issue anymore. Duplicate events are still a problem, but that should be taken care of when ingesting new data. Have you tried updating the data recently?
(It should work even with the previous data present, but might give some essentially ineffectual messages, https://github.com/andybega/icews/issues/54#issuecomment-637425359)
After a fresh install on Ubuntu 18.04, the following fails after downloading 151 files (73.1 MB) with an error:
Launching
update_icews(dryrun = FALSE)
again and again does not solve the issue.The following (launched after the error) might help: