avinassh / fast-sqlite3-inserts

Some bunch of test scripts to generate a SQLite DB with 1B rows in fastest possible way
MIT License
381 stars 38 forks source link

Use SQL for generation #11

Open grishy opened 3 years ago

grishy commented 3 years ago

What about a using SQLite for test data generation?

Example:

CREATE TABLE test (
  id INTEGER PRIMARY KEY NOT NULL, 
  x REAL NOT NULL,
  y REAL NOT NULL,
  z REAL NOT NULL
);
INSERT INTO test
    WITH RECURSIVE
      cnt( id, x, y, z) AS (
      VALUES(1 , random(), random(), random()) UNION ALL 
      SELECT id+1,random(),random(), random() FROM cnt WHERE ID<1000)
    select * from cnt;

https://paulbradley.org/sqlite-test-data/ https://stackoverflow.com/questions/17931320/how-to-insert-random-data-into-a-sqlite-table-using-only-queries

WorldMaker commented 3 years ago

I was thinking about the same thing. It would be interesting to compare if everything was offloaded to a single transaction script like this and see if offloading all the math to SQLite makes sense. While there are range restrictions on the values, there aren't distribution restrictions (normal, poisson, what have you) and the post already says that any regular PRNG is fine and crypto-strength not necessary, so SQLite random() should be capable of meeting the constraints.

Having a quick scratch at the math, it would look something like:

Insert into user
with recursive usr(id, area, age, active) as (
    values (1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1))
    union all select id + 1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1) from usr
    where id < 100000
)
select * from usr
avinassh commented 3 years ago

@WorldMaker hey, could you send this code as a PR?

avinassh commented 3 years ago

I ran this locally, this is quite slow. Do let me know if i am missing something.

  1. First I manually created a DB, created a table.
  2. Then I ran this following script:
#!/bin/sh

sqlite3 pure_sql.db <<'END_SQL'
insert into user
with recursive usr(id, area, age, active) as (
    values (1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1))
    union all select id + 1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1) from usr
    where id < 100000000
)
select * from usr;
END_SQL

saved this in a script and ran it with time:

time ./sq.sh
./sq.sh  111.59s user 2.35s system 96% cpu 1:58.00 total
grishy commented 3 years ago

@avinassh It turns out - is it at least faster than PyPy?

grishy commented 3 years ago
PRAGMA journal_mode = OFF;
PRAGMA synchronous = 0;
PRAGMA cache_size = 1000000;
PRAGMA locking_mode = EXCLUSIVE;
PRAGMA temp_store = MEMORY;

Do they influence here?

avinassh commented 3 years ago

I had tried with them too, but it did not make much difference:

#!/bin/sh

sqlite3 pure_sql.db <<'END_SQL'
PRAGMA journal_mode = OFF;
PRAGMA synchronous = 0;
PRAGMA cache_size = 1000000;
PRAGMA locking_mode = EXCLUSIVE;
PRAGMA temp_store = MEMORY;
insert into user
with recursive usr(id, area, age, active) as (
    values (1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1))
    union all select id + 1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1) from usr
    where id < 100000000
)
select * from usr;
END_SQL

output:

$ time ./sq.sh

off
exclusive
./sq.sh  114.55s user 3.31s system 99% cpu 1:58.93 total
WorldMaker commented 3 years ago

Interesting results. We're about at the limit of my personal experience with SQLite directly and a lot of the things you might try to speed things up in other SQL environments obviously don't apply to SQLite.

Reviewing the Stack Overflow posted above, the sqlite3 shell supports an extension named generate_series which the documentation says is faster, so you could try that too (since you are already using the sqlite3 shell):

#!/bin/sh

sqlite3 pure_sql.db <<'END_SQL'
insert into user (area, age, active)
select abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1)
from generate_series(0, 100000000);
END_SQL
asg017 commented 3 years ago

generate_series is definitely faster than a recursive query, about 2x in my test.

Using generate_series:

-- load.sql
insert into user (area, age, active)
select abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1)
from generate_series(0, 100000000);
$ time sqlite3 t1.db '.read schema.sql' '.read load.sql'

real    0m55.747s
user    0m45.945s
sys 0m5.894s

Using recursive:

-- load-rec.sql
insert into user
with recursive usr(id, area, age, active) as (
    values (1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1))
    union all select id + 1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1) from usr
    where id < 100000000
)
select * from usr;
$ time sqlite3 t2.db '.read schema.sql' '.read load-rec.sql'

real    1m42.659s
user    1m35.587s
sys 0m4.820s

Adding in the pragma statements didnt make a difference in my test, in fact is slowed it down a few seconds.

Interestingly, I tried without random() calls and hardcoded some numbers, and it was even faster:

INSERT INTO user (area, age, active) 
  SELECT 
    1000000, 
    5, 
    1 
  FROM generate_series(1, 100000000, 1);
real    0m27.924s
user    0m17.080s
sys 0m6.985s

So generate_series comes out at 55 seconds, still slower than the best rust method, but seemingly faster than the other ones. And there's about 25 seconds room for improvement if there's a faster random generator!

Unsure if you're still accepting PRs, but lmk if you want to see one using generate_series!

avinassh commented 3 years ago

Interestingly, I tried without random() calls and hardcoded some numbers, and it was even faster:

Someone else did a flamegraph, I think some good time is being spent in random (in rust solution)

Unsure if you're still accepting PRs, but lmk if you want to see one using generate_series!

I don't have a pure SQL solution, so I will be happy to accept a PR on this since is close to python one :)

asg017 commented 2 years ago

Created a PR for this in #23! Based on of my machine, it appears to be faster than even the fastest rust solution, but would love to see a more official benchmark run