brawer / wikidata-qrank

Ranking signals for Wikidata
https://qrank.wmcloud.org
MIT License
61 stars 5 forks source link

publish sqlite DB in addition to CSV? #19

Open cldellow opened 9 months ago

cldellow commented 9 months ago

Hello, thank you for making qrank! It saves a lot of labour and pipeline fiddling.

Would it be possible to also publish qrank as a sqlite DB in addition to a CSV file?

The SQLite format has been stable since 2004. IMO it can serve a similar role as CSV, but with the benefit of having fast random-access lookups when the table is indexed.

The SQLite library is widely available on modern OSes, with bindings for all popular programming languages. Some environments, such as Python and Bun, bundle SQLite so that it can be immediately used without installing any additional packages.

If the DB is created with the full Wikidata ID (e.g., the string Q12345), a SQLite DB is approx 1.1GB - approx 3 times the size as the CSV.

If created with the ID converted to a number (e.g., the number 12345), a SQLite DB is approx 350MB -- about the same size as the CSV.

Creating the DB is straight-forward, e.g. this bash script:

do_the_thing() {
    cat <<EOF
.mode csv
.import qrank.csv qrank_raw
create table qrank(id integer not null primary key, rank integer not null);
insert into qrank(id, rank) select substr(Entity, 2), QRank FROM qrank_raw;
drop table qrank_raw;
vacuum;
EOF
}

rm -f qrank.db
do_the_thing | sqlite3 qrank.db

I understand if this is out of scope for the project -- but if you're open to serving an extra file, I suspect it'd eliminate one more step from many people's pipelines.

cldellow commented 8 months ago

If anyone else is interested in this, I've started republishing the QRank dataset as an SQLite artifact weekly. It can be fetched from https://github.com/hikeratlas/qrank/releases/download/latest/qrank.db

Hope that's OK - if not, let me know and I'll happily take it down.