Closed athanhat closed 1 year ago
It seems that the problem I had with the index is related to the wrong import of data from the tsv
file.
?[rowid, from, via, to, totype] := SGraph[from, via, to, rowid, totype]
:put AG {rowid => from, via, to, totype}
You would expect that SGraph
relation read from CozoScript CsvReader
is converted on the fly to the ?
return relation sorted by a different order of columns. But this is not the case, and that is why the count of rows is wrong. My guess is that there are rows that have been skipped because there are duplicates of (from, via, to). Is that correct ?
Perhaps the solution here is to use ReorderSort
which can be applied to the intermediate result SGraph in order to achieve the reordering. I am not sure about the syntax and cannot find an example. Can anyone help with this ?
If you change the key to (from, via, to, rowid) you would expect that it will produce a relation with all rows but it failed, why ???
{:create AKB {from: String, via:String, to:String, rowid: String => totype:String}}
SGraph[] <~ CsvReader(types: ['String', 'String', 'String', 'String', 'String'],
url: "file://all.tsv",
delimiter: '\t',
has_headers: true)
?[from, via, to, rowid, totype] := SGraph[from, via, to, rowid, totype]
:put AKB {from, via, to, rowid => totype}
?[count(rowid)] := *AKB{rowid}
3505082
If you don't specify any functional dependency then
{:create AKB {from: String, via:String, to:String, rowid: String => totype:String}}
?[from, via, to, rowid, totype] := SGraph[from, via, to, rowid, totype]
:put AKB {from, via, to, rowid => totype}
?[count(rowid)] := *AKB{rowid}
3603898
Wall time: In mem, 16.4 sec
Wall time: On dsk, 29.6 sec
But this count is also wrong, check with
wc -l all.tsv
3680804
I think there is no point of discussing about creating an index here, e.g. ndx[from via], since the order of columns read from the file and the key created in the relation covers the case
Now with that new kind of wrong import, I had to run all the queries again to measure the performance. This time I tried my benchmarks both in-memory and on disk here are the results:
?[from, via, to, rowid, totype] := *AKB[from, via, to, rowid, totype], rowid='Q4521676-alias-en-adfdbd-0'
In-Mem: 1.410 sec On-Dsk: 2.610 sec
?[from, via, to, totype] := *AKB{from, via, to, totype}, from="Q2685", via="alias"
In-Mem: 1.29 sec On-Dsk: 2.54 sec
Children[from, via, to] := *AKB[_, "Q2685", via, to, totype],
from="arnold", via="P40", totype="wikibase-item"
?[who, child, name] := Children[who, via, child], *AKB{from: child, via: "alias", to: name}
In-Mem: 2.100 sec On-Dsk: 4.230 sec
What kind of Rust data structures are you using for CozoDB ? Can you share a brief explanation on reading/transferring data from the KV store to the memory and how joins are implemented under the hood ?
Hi, I tried to use indexes in CozoScript, I am experimenting with KGTK knowledge graphs
The Cozo database was built from this tsv file
Here is the script to import the data
Create Index based on from, via field
Question 1: The number of rows that were read into RocksDB is smaller than the number of lines in the file, why ?
Notice: it took more than 3sec to count the lines in RocksDB, with bash
wc -l
only 100 msecQuestion 2: There is not any significant difference in execution time, why ?
Is that because of the order of AG data that are already stored in database (tree) ?
rowid, from, via, to, totype
Question 3:
The following query reads: find all
triplets
(from, via, to) of ArnoldThe set of rows I get from executing the query with the index is significantly smaller compared to the one I get without the index. Why is that happening ?
For reference, compare the output of the following command with and without index