Open grishy opened 3 years ago
I was thinking about the same thing. It would be interesting to compare if everything was offloaded to a single transaction script like this and see if offloading all the math to SQLite makes sense. While there are range restrictions on the values, there aren't distribution restrictions (normal, poisson, what have you) and the post already says that any regular PRNG is fine and crypto-strength not necessary, so SQLite random()
should be capable of meeting the constraints.
Having a quick scratch at the math, it would look something like:
Insert into user
with recursive usr(id, area, age, active) as (
values (1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1))
union all select id + 1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1) from usr
where id < 100000
)
select * from usr
@WorldMaker hey, could you send this code as a PR?
I ran this locally, this is quite slow. Do let me know if i am missing something.
#!/bin/sh
sqlite3 pure_sql.db <<'END_SQL'
insert into user
with recursive usr(id, area, age, active) as (
values (1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1))
union all select id + 1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1) from usr
where id < 100000000
)
select * from usr;
END_SQL
saved this in a script and ran it with time:
time ./sq.sh
./sq.sh 111.59s user 2.35s system 96% cpu 1:58.00 total
@avinassh It turns out - is it at least faster than PyPy?
PRAGMA journal_mode = OFF;
PRAGMA synchronous = 0;
PRAGMA cache_size = 1000000;
PRAGMA locking_mode = EXCLUSIVE;
PRAGMA temp_store = MEMORY;
Do they influence here?
I had tried with them too, but it did not make much difference:
#!/bin/sh
sqlite3 pure_sql.db <<'END_SQL'
PRAGMA journal_mode = OFF;
PRAGMA synchronous = 0;
PRAGMA cache_size = 1000000;
PRAGMA locking_mode = EXCLUSIVE;
PRAGMA temp_store = MEMORY;
insert into user
with recursive usr(id, area, age, active) as (
values (1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1))
union all select id + 1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1) from usr
where id < 100000000
)
select * from usr;
END_SQL
output:
$ time ./sq.sh
off
exclusive
./sq.sh 114.55s user 3.31s system 99% cpu 1:58.93 total
Interesting results. We're about at the limit of my personal experience with SQLite directly and a lot of the things you might try to speed things up in other SQL environments obviously don't apply to SQLite.
Reviewing the Stack Overflow posted above, the sqlite3
shell supports an extension named generate_series which the documentation says is faster, so you could try that too (since you are already using the sqlite3
shell):
#!/bin/sh
sqlite3 pure_sql.db <<'END_SQL'
insert into user (area, age, active)
select abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1)
from generate_series(0, 100000000);
END_SQL
generate_series
is definitely faster than a recursive query, about 2x in my test.
Using generate_series
:
-- load.sql
insert into user (area, age, active)
select abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1)
from generate_series(0, 100000000);
$ time sqlite3 t1.db '.read schema.sql' '.read load.sql'
real 0m55.747s
user 0m45.945s
sys 0m5.894s
Using recursive:
-- load-rec.sql
insert into user
with recursive usr(id, area, age, active) as (
values (1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1))
union all select id + 1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1) from usr
where id < 100000000
)
select * from usr;
$ time sqlite3 t2.db '.read schema.sql' '.read load-rec.sql'
real 1m42.659s
user 1m35.587s
sys 0m4.820s
Adding in the pragma statements didnt make a difference in my test, in fact is slowed it down a few seconds.
Interestingly, I tried without random()
calls and hardcoded some numbers, and it was even faster:
INSERT INTO user (area, age, active)
SELECT
1000000,
5,
1
FROM generate_series(1, 100000000, 1);
real 0m27.924s
user 0m17.080s
sys 0m6.985s
So generate_series
comes out at 55 seconds, still slower than the best rust method, but seemingly faster than the other ones. And there's about 25 seconds room for improvement if there's a faster random generator!
Unsure if you're still accepting PRs, but lmk if you want to see one using generate_series
!
Interestingly, I tried without random() calls and hardcoded some numbers, and it was even faster:
Someone else did a flamegraph, I think some good time is being spent in random (in rust solution)
Unsure if you're still accepting PRs, but lmk if you want to see one using generate_series!
I don't have a pure SQL solution, so I will be happy to accept a PR on this since is close to python one :)
Created a PR for this in #23! Based on of my machine, it appears to be faster than even the fastest rust solution, but would love to see a more official benchmark run
What about a using SQLite for test data generation?
Example:
https://paulbradley.org/sqlite-test-data/ https://stackoverflow.com/questions/17931320/how-to-insert-random-data-into-a-sqlite-table-using-only-queries