cozodb / cozo

A transactional, relational-graph-vector database that uses Datalog for query. The hippocampus for AI!
https://cozodb.org
Mozilla Public License 2.0
3.44k stars 108 forks source link

Importing/Indexing problem #174

Closed athanhat closed 1 year ago

athanhat commented 1 year ago

Hi, I tried to use indexes in CozoScript, I am experimenting with KGTK knowledge graphs

The Cozo database was built from this tsv file

Here is the script to import the data

{:create AG {rowid: String => from: String, via:String, to:String, totype:String}}
SGraph[] <~ CsvReader(types: ['String', 'String', 'String', 'String', 'String'], 
                      url: "file://all.tsv", 
                      delimiter: '\t', 
                      has_headers: true)
?[rowid, from, via, to, totype] := SGraph[from, via, to, rowid, totype]
:put AG {rowid => from, via, to, totype}

# Wall time: In memory, 16.4 sec
# Wall time: On disk, 29.6 sec

::compact
# Wall time: 10sec
# DB Size: 154MB `arnold.db/`

NR[count(rowid)] := *AG{rowid}
?[nrows] := NR[nrows]

# nrows: 3505082
# Wall time: 3.11 s

# time wc -l all.tsv
# 3680804 all.tsv
# real   100msec

# Arnold aliases without indexing
?[from, via, to, totype] := *AG{from, via, to, totype}, from="Q2685", via="alias"
# Wall time: 3.23 s

Create Index based on from, via field

::index create AG:idx {from, via}
# Wall time: 24.6 sec

::columns AG:idx

::compact
# Wall time: 10sec
# DB Size: 163MB

::explain {?[from, via, to, totype] := *AG{from, via, to, totype}, from="Q2685", via="alias"}

?[from, via, to, totype] := *AG{from, via, to, totype}, from="Q2685", via="alias"
# Wall time: 3.19 sec

Question 1: The number of rows that were read into RocksDB is smaller than the number of lines in the file, why ?

Notice: it took more than 3sec to count the lines in RocksDB, with bash wc -l only 100 msec

Question 2: There is not any significant difference in execution time, why ?

Is that because of the order of AG data that are already stored in database (tree) ?

rowid, from, via, to, totype

Question 3:

The following query reads: find all triplets (from, via, to) of Arnold

?[from, via, to, totype] := *AG[_, "Q2685", via, to, totype], from="arnold"

The set of rows I get from executing the query with the index is significantly smaller compared to the one I get without the index. Why is that happening ?

For reference, compare the output of the following command with and without index

athanhat commented 1 year ago

Source of problem: Importing data to a relation without functional dependency

It seems that the problem I had with the index is related to the wrong import of data from the tsv file.

?[rowid, from, via, to, totype] := SGraph[from, via, to, rowid, totype]
:put AG {rowid => from, via, to, totype}

You would expect that SGraph relation read from CozoScript CsvReader is converted on the fly to the ? return relation sorted by a different order of columns. But this is not the case, and that is why the count of rows is wrong. My guess is that there are rows that have been skipped because there are duplicates of (from, via, to). Is that correct ?

Solution 1:

Perhaps the solution here is to use ReorderSort which can be applied to the intermediate result SGraph in order to achieve the reordering. I am not sure about the syntax and cannot find an example. Can anyone help with this ?

Solution 2: FAILED !

If you change the key to (from, via, to, rowid) you would expect that it will produce a relation with all rows but it failed, why ???

{:create AKB {from: String, via:String, to:String, rowid: String => totype:String}}

SGraph[] <~ CsvReader(types: ['String', 'String', 'String', 'String', 'String'], 
                      url: "file://all.tsv", 
                      delimiter: '\t', 
                      has_headers: true)

?[from, via, to, rowid, totype] := SGraph[from, via, to, rowid, totype]
:put AKB {from, via, to, rowid => totype}

?[count(rowid)] := *AKB{rowid}
3505082

Solution 3: FAILED !

If you don't specify any functional dependency then

{:create AKB {from: String, via:String, to:String, rowid: String => totype:String}}
?[from, via, to, rowid, totype] := SGraph[from, via, to, rowid, totype]
:put AKB {from, via, to, rowid => totype}

?[count(rowid)] := *AKB{rowid}
3603898

Wall time: In mem, 16.4 sec
Wall time: On dsk, 29.6 sec

But this count is also wrong, check with

wc -l all.tsv 
3680804

I think there is no point of discussing about creating an index here, e.g. ndx[from via], since the order of columns read from the file and the key created in the relation covers the case

athanhat commented 1 year ago

Query Results

Now with that new kind of wrong import, I had to run all the queries again to measure the performance. This time I tried my benchmarks both in-memory and on disk here are the results:

Point Query

?[from, via, to, rowid, totype] := *AKB[from, via, to, rowid, totype], rowid='Q4521676-alias-en-adfdbd-0'

In-Mem: 1.410 sec On-Dsk: 2.610 sec

Arnold Aliases

?[from, via, to, totype] := *AKB{from, via, to, totype}, from="Q2685", via="alias"

In-Mem: 1.29 sec On-Dsk: 2.54 sec

Arnold Children with their aliases

Children[from, via, to] := *AKB[_, "Q2685", via, to, totype], 
                           from="arnold", via="P40", totype="wikibase-item"
?[who, child, name] := Children[who, via, child], *AKB{from: child, via: "alias", to: name}

In-Mem: 2.100 sec On-Dsk: 4.230 sec

Question: I cannot see a big difference here in query execution time in comparison to the disk storage engine (RocksDB).

What kind of Rust data structures are you using for CozoDB ? Can you share a brief explanation on reading/transferring data from the KV store to the memory and how joins are implemented under the hood ?