duckdb / sqlite_scanner

DuckDB extension to read and write to SQLite databases
MIT License
190 stars 19 forks source link

DuckDB sqlite extension cannot manage sqlite data file from Chrome Browser "History" #44

Closed orzom411 closed 8 months ago

orzom411 commented 1 year ago

The DuckDB sqlite extension is not able to access sqlite data files, Chrome Browser History example. I suspect I'm doing something incorrect, just missed a step or three. When I try to attach a sqlite data file to DuckDB, I get the following error message:

Error: Parser Error: zero-length delimited identifier at or near "" LINE 1: SELECT ""

I have verified that the sqlite data file is valid and can be opened by the sqlite3 command-line tool.

Steps to Reproduce: Using DuckDB CLI master (v0.7.1 b00b93f0b1)

First I copy "History" file to a temp folder. Mine was located (note users should be your user id): C:\Users\users\AppData\Local\Google\Chrome\User Data\Default\History

Attempt to load: D attach 'd:/temp/History' (type sqlite);

Attempt to pull a table list:

D .tables Error: Parser Error: zero-length delimited identifier at or near """" LINE 1: SELECT ""

Expected Behavior: I expect to be able to attach the sqlite data file to DuckDB and run queries against it.

Actual Behavior: I get the following error message when I try to run a query against the sqlite data file:

Error: Parser Error: zero-length delimited identifier at or near "" LINE 1: SELECT ""

Tested the same file via sqlite3.exe sqlite3.exe

SQLite version 3.41.2 2023-03-22 11:56:21 Enter ".help" for usage hints. Connected to a transient in-memory database. Use ".open FILENAME" to reopen on a persistent database.

sqlite> .open 'd:/temp/History'

sqlite> .databases main: d:\temp\History r/w

sqlite> .tables cluster_keywords downloads segments cluster_visit_duplicates downloads_slices typed_url_sync_metadata clusters downloads_url_chains urls clusters_and_visits keyword_search_terms visit_source content_annotations meta visits context_annotations segment_usage

Environment: • OS: Windows 10 • DuckDB version master: v0.7.1 b00b93f0b1 • SQLite version: SQLite version 3.41.2 2023-03-22 11:56:21

In addition I tried to install and load the sqlite extension

D load sqlite
> ;

Error: Invalid Input Error: Initialization function "sqlite_scanner_init" from file "C:\Users\user.duckdb\extensions\v0.7.1\windows_amd64\sqlite_scanner.duckdb_extension" threw an exception: "Catalog Error: Table Function with name "sqlite_scan" already exists!"

Please note Chrome Browser has other sqlite data files and I think using DuckDB to analyze would be a good use case.

Mytherin commented 1 year ago

Thanks for the report! I have fixed the extension loading issue - but unfortunately I cannot reproduce the main issue with my Chrome history on MacOS:

$ > /opt/homebrew/bin/duckdb
v0.7.1 b00b93f0b1
D ATTACH '/Users/myth/Library/Application Support/Google/Chrome/Default/History' (TYPE SQLITE);
D .tables
clusters                 downloads_slices         typed_url_sync_metadata
clusters_and_visits      downloads_url_chains     urls                   
content_annotations      keyword_search_terms     visit_source           
context_annotations      meta                     visits                 
downloads                segment_usage          
downloads_reroute_info   segments               

It is possible there is a piece of data in the history that is specifically causing this issue here. It would be super helpful if you could send me the SQLite file so I could debug the issue (mark@duckdblabs.com) - although I fully understand if you would prefer not to do so for privacy concerns of course. As an alternative perhaps you could help triage it by trying to figure out which table is the problem using the sqlite_scan function, e.g.:

SELECT * FROM sqlite_scan('/Users/myth/Library/Application Support/Google/Chrome/Default/History', 'urls');
orzom411 commented 1 year ago

@Mytherin Thank you for getting back to me. Narrowing down the table(s) that might be an issue. As you might expect I can't forward a full datafile, I will see about mocking up one that has the issue identified, now that I have a better idea of what to look for and how to test it), and, by the way, thank you for giving me instructions to access the tables individually, very much appreciated.

Diving in, first noted our table list is slightly different, curious but I don't think that's relevant. Narrowed down the list to three tables that appear to have a sqlite_scan issue, so I tested from SQLite's prospective, via counts:

sqlite> SELECT count(*) as ct FROM clusters; ct 3396 sqlite> SELECT count(*) as ct FROM cluster_visit_duplicates; ct 19309 sqlite> SELECT count(*) as ct FROM clusters_and_visits; ct 14760

DuckDB, via the sqlite_scan, we snag on these tables in the following ways:

c:\tools\duckdb v0.7.2-dev14 c817201fa8 Enter ".help" for usage hints. Connected to a transient in-memory database. Use ".open FILENAME" to reopen on a persistent database. D LOAD sqlite_scanner;

D SELECT count(*) as ct FROM sqlite_scan('d:/temp/History', 'clusters' ) limit 10; Error: Parser Error: zero-length delimited identifier at or near """" LINE 1: SELECT "" ^

D SELECT count(*) as ct FROM sqlite_scan('d:/temp/History', 'cluster_visit_duplicates' ) limit 10; Error: Invalid Error: Failed to prepare query "SELECT ROWID FROM "cluster_visit_duplicates"": no such column: ROWID

D SELECT count(*) as ct FROM sqlite_scan('d:/temp/History', 'clusters_and_visits' ) limit 10; Error: Invalid Error: Failed to prepare query "SELECT ROWID FROM "clusters_and_visits"": no such column: ROWID

Confirmed no ROWID in either of the two latter tables:

sqlite> SELECT * FROM cluster_visit_duplicates limit 5; visit_id|duplicate_visit_id 260949|260948 260951|260950 260954|260953 260955|260952 260957|260956

sqlite> SELECT * FROM clusters_and_visits where url_for_display like 'youtube.com%' limit 1; cluster_id|visit_id|score|engagement_score|url_for_deduping|normalized_url|url_for_display 3455|260954|1|94.419075012207|https://youtube.com/|https://www.youtube.com/shorts/MRo1mUAS4lQ|youtube.com/shorts/MRo1mUAS4lQ

The "clusters" table I'm not sure what the issue is. I started with the thought it was a reserved word, clusters may need to be encapsulated but I couldn't confirm that. Here is example data from the target file:

sqlite> SELECT * FROM clusters where label like 'youtube.com' limit 5; cluster_id|should_show_on_prominent_ui_surfaces|label|raw_label|triggerability_calculated|originator_cache_guid|originator_cluster_id 2987|0|youtube.com|youtube.com|1||0 3268|0|youtube.com|youtube.com|1||0 3455|0|youtube.com|youtube.com|1||0 3476|0|youtube.com|youtube.com|1||0 3492|0|youtube.com|youtube.com|1||0

To complete the table list, here are my counts:

c:\tools\duckdb v0.7.2-dev14 c817201fa8 Enter ".help" for usage hints. Connected to a transient in-memory database. Use ".open FILENAME" to reopen on a persistent database.

D .headers on
D .mode ascii -- csv
D .separator "|" "\r\n"
D ;
D .timer on

D LOAD sqlite_scanner; Run Time (s): real 0.063 user 0.046875 sys 0.015625 D SELECT count(*) as ct FROM sqlite_scan('d:/temp/History', 'cluster_keywords' ) limit 10; ct 4819 Run Time (s): real 0.002 user 0.015625 sys 0.000000 D SELECT count(*) as ct FROM sqlite_scan('d:/temp/History', 'downloads' ) limit 10; ct 4006 Run Time (s): real 0.004 user 0.000000 sys 0.000000 D SELECT count(*) as ct FROM sqlite_scan('d:/temp/History', 'segments' ) limit 10; ct 71 Run Time (s): real 0.001 user 0.000000 sys 0.000000 D SELECT count(*) as ct FROM sqlite_scan('d:/temp/History', 'downloads_slices' ) limit 10; ct 0 Run Time (s): real 0.001 user 0.000000 sys 0.000000 D SELECT count(*) as ct FROM sqlite_scan('d:/temp/History', 'typed_url_sync_metadata' ) limit 10; ct 71 Run Time (s): real 0.002 user 0.000000 sys 0.000000 D SELECT count(*) as ct FROM sqlite_scan('d:/temp/History', 'downloads_url_chains' ) limit 10; ct 6551 Run Time (s): real 0.003 user 0.000000 sys 0.000000 D SELECT count(*) as ct FROM sqlite_scan('d:/temp/History', 'urls' ) limit 10; ct 17188 Run Time (s): real 0.011 user 0.000000 sys 0.000000 D SELECT count(*) as ct FROM sqlite_scan('d:/temp/History', 'keyword_search_terms' ) limit 10; ct 2224 Run Time (s): real 0.002 user 0.000000 sys 0.000000 D SELECT count(*) as ct FROM sqlite_scan('d:/temp/History', 'visit_source' ) limit 10; ct 29 Run Time (s): real 0.002 user 0.000000 sys 0.000000 D SELECT count(*) as ct FROM sqlite_scan('d:/temp/History', 'content_annotations' ) limit 10; ct 32483 Run Time (s): real 0.007 user 0.000000 sys 0.000000 D SELECT count(*) as ct FROM sqlite_scan('d:/temp/History', 'meta' ) limit 10; ct 5 Run Time (s): real 0.001 user 0.000000 sys 0.000000 D SELECT count(*) as ct FROM sqlite_scan('d:/temp/History', 'visits' ) limit 10; ct 37623 Run Time (s): real 0.006 user 0.015625 sys 0.000000 D SELECT count(*) as ct FROM sqlite_scan('d:/temp/History', 'context_annotations' ) limit 10; ct 33839 Run Time (s): real 0.006 user 0.000000 sys 0.000000 D SELECT count(*) as ct FROM sqlite_scan('d:/temp/History', 'segment_usage' ) limit 10; ct 6035 Run Time (s): real 0.002 user 0.000000 sys 0.000000

I'll see if I can spot a data artifact in the clusters table, will be a little while before I can get back to you on this specifically (day job takes up so much time).

Thank you for your help, this is proving to be a worthy test.

Mytherin commented 1 year ago

Interestingly both clusters_and_visits and clusters are empty for me:

D SELECT * FROM clusters_and_visits limit 5;
┌────────────┬──────────┬────────┐
│ cluster_id │ visit_id │ score  │
│   int64    │  int64   │ double │
├────────────────────────────────┤
│             0 rows             │
└────────────────────────────────┘
D SELECT * FROM clusters;
┌────────────┬────────┐
│ cluster_id │ score  │
│   int64    │ double │
├─────────────────────┤
│       0 rows        │
└─────────────────────┘

That is likely why I am not encountering the issue.

Would you be open to stripping the other tables from the file and sending the file with only those tables over? Looking at the tables they seem to not have any private information in isolation.

orzom411 commented 1 year ago

To replicate the structure issue noted in the chrome browser SQLite "History" data file:

c:\tools\sqlite3 d:\temp\History.samplestructure.db

CREATE TABLE clusters_and_visits(cluster_id INTEGER NOT NULL,visit_id INTEGER NOT NULL,score NUMERIC NOT NULL,engagement_score NUMERIC NOT NULL,url_for_deduping LONGVARCHAR NOT NULL,normalized_url LONGVARCHAR NOT NULL,url_for_display LONGVARCHAR NOT NULL,PRIMARY KEY(cluster_id,visit_id))WITHOUT ROWID;

CREATE TABLE "clusters"(cluster_id INTEGER PRIMARY KEY AUTOINCREMENT,should_show_on_prominent_ui_surfaces BOOLEAN NOT NULL,label VARCHAR NOT NULL,raw_label VARCHAR NOT NULL,triggerability_calculated BOOLEAN NOT NULL,originator_cache_guid TEXT DEFAULT "" NOT NULL,originator_cluster_id INTEGER DEFAULT 0 NOT NULL);

(Please note I omitted some details, see earlier comments for version information if needed. I did keep it for DuckDB as I'm in the process of building and testing, it will change and I want to be sure this detail is known).

Shift over to DuckDB

c:\tools\duckdb v0.7.2-dev14 c817201fa8 Enter ".help" for usage hints. Connected to a transient in-memory database. Use ".open FILENAME" to reopen on a persistent database.

D LOAD sqlite_scanner; D SELECT count(*) as ct FROM sqlite_scan('d:/temp/History.recover.db', 'clusters_and_visits' ); Error: Invalid Error: Failed to prepare query "SELECT ROWID FROM "clusters_and_visits"": no such column: ROWID D SELECT count(*) as ct FROM sqlite_scan('d:/temp/History.samplestructure.db', 'clusters' ) limit 10; Error: Parser Error: zero-length delimited identifier at or near """" LINE 1: SELECT ""

_* Please also note I didn't perform the "INSTALL sqlitescanner;" as I've already downloaded it, e.g. just load is needed.

The first issue, no ROWID, is clearly the structure states "WITHOUT ROWID". For now I think a simple mitigation, if this data is needed, would be to copy the data out via SQLite into a table that doesn't have this restriction. I think noting it as a limitation would help future users who land on the error message if it's something essential to how sqlite_scanner operates.

The second issue I believe comes from the two defaults, might actually be just the zero length string default, e.g. if I create a table without the two defaults:

CREATE TABLE "clusters_nodefaults"(
cluster_id INTEGER PRIMARY KEY AUTOINCREMENT
,should_show_on_prominent_ui_surfaces BOOLEAN NOT NULL
,label VARCHAR NOT NULL
,raw_label VARCHAR NOT NULL
,triggerability_calculated BOOLEAN NOT NULL
,originator_cache_guid TEXT
,originator_cluster_id INTEGER);

I can then read if from DuckDB's sqlite_scan:

c:\tools\duckdb v0.7.2-dev14 c817201fa8 Enter ".help" for usage hints. Connected to a transient in-memory database. Use ".open FILENAME" to reopen on a persistent database.

D .headers on
D .mode ascii -- csv
D .separator "|" "\r\n"
D ;
D LOAD sqlite_scanner;
D SELECT count(*) as ct FROM sqlite_scan('d:/temp/History.samplestructure.db', 'clusters_nodefaults' ) limit 10;

ct 0

Thank you again for looking into this and giving me a viable workaround for the details, very much appreciated.

I should have noted, chrome browser on windows 10, Version 113.0.5672.63 (Official Build) (64-bit). I don't know if I simply have an old structure, e.g. did they change this, maybe OS or a privacy option distinction has anything to do with the issue. From the above it does look like defining the tables in the manner this History file does is legitimate from SQLite's prospective. Should also note I never dump history and it's been running for a number of years now, so I guess the structure may have changed over time but was never actually migrated. I will, next opportunity, check the structure from a fresh install and update this thread.