EndPointCorp / end-point-blog

End Point Dev blog
https://www.endpointdev.com/blog/
17 stars 65 forks source link

Comments for Speeding Up Saving Millions of ORM Objects in PostgreSQL #961

Open phinjensen opened 6 years ago

phinjensen commented 6 years ago

Comments for https://www.endpointdev.com/blog/2014/04/speeding-up-saving-millions-of-orm/ By Szymon Lipiński

To enter a comment:

  1. Log in to GitHub
  2. Leave a comment on this issue.
phinjensen commented 6 years ago
original author: Bartosz Radaczyński
date: 2014-04-11T14:37:46-04:00

Passwords can be provided by setting environment variables PGUSER and PGPASSWORD. Arguably that is better than the ~/.psql file (http://www.postgresql.org/docs/9.1/static/libpq-envars.html)

phinjensen commented 6 years ago
original author: Andreas Kostyrka
date: 2014-04-11T14:58:23-04:00

Well, you've forgot two things:

1.) psycopg2 has support for COPY. 2.) If you have access to the server file system, you can write out the data into a temporary file, and use a COPY command with a filename.

phinjensen commented 6 years ago
original author: Jon Jensen
date: 2014-04-11T17:22:05-04:00

I think Szymon meant ~/.pgpass for the password file.

Arguably, there's not signficant difference in security between environment variables and ~/.pgpass because Postgres requires that ~/.pgpass be readable only by the owner. Any user that can read that can also read (e.g. on Linux) /proc/$pid/environ and see your environment variables.

phinjensen commented 6 years ago
original author: mike bayer
date: 2014-04-12T18:31:56-04:00

I can give you a trick that will probably get that 25s down to more like 10 or less: When you make your random objects, pre-assign the primary key integer value as well. The SQLAlchemy ORM will then insert the records using a single executemany() call to psycopg2, rather than calling execute() for individual INSERT statements, as it no longer needs to fetch newly generated SERIAL values. See http://stackoverflow.com/questions/11769366/why-is-sqlalchemy-insert-with-sqlite-25-times-slower-than-using-sqlite3-directly/11769768#11769768 for a detailed discussion/demo of this.

phinjensen commented 6 years ago
original author: veidelis
date: 2014-04-16T04:20:21-04:00

A great post!

phinjensen commented 6 years ago
original author: Szymon Guz
date: 2014-04-16T04:31:29-04:00

@Bartosz: Yep, however this provides additional complexity to the project, and I wanted to have it as simple as possible.

This way I will need to store the passwords in a couple of places after deploying the application, as this will need to have its config with passwords, and another in ~/.pgpass.

Setting in the env vars is not better, you need to have a script for starting the app, which sets proper variables before starting that. I'd rather have all of them in ~/.pgpass, which allows to set different passwords for different users, databases and servers at the same time.

What's more ~/.pgpass is not universal, as e.g. PostgreSQL JDBC driver doesn't use it, nor the env variables, and you need to provide proper connection string.

phinjensen commented 6 years ago
original author: Szymon Guz
date: 2014-04-16T04:32:24-04:00

@Andreas: I have written about the psycopg2's support for COPY (the last two paragraphs).

Saving the csv/tsv to a file is not an option when working with really huge number of rows. The file can be quite big, and the overhead of writing the file, and then reading, and writing the data again is just too big (yes, I've checked that a couple of times).

Using pipes between processes is usually much faster, as it uses much less memory, and, ideally, no disk at all.

phinjensen commented 6 years ago
original author: Szymon Guz
date: 2014-04-16T04:34:03-04:00

@Jon: I read through some documentation, and it looks like sometimes the ~/.pgpass can be not accessible. Heroku doesn't give you access to this, and you need to use just the env variables. The PostgreSQL JDBC driver undertands only the jdbc connection strings, of course this can be made with env variables, but usually it is not.

phinjensen commented 6 years ago
original author: Szymon Guz
date: 2014-04-16T05:06:32-04:00

@Mike: Thanks for your comment.

I'm not sure this will speed the things up. Currently there is generated a query like:

INSERT INTO posts (title, body, payload) VALUES ('a', 'b', 'c') RETURNING posts.id

This let's the database to insert the id into the inserted row, and then it returns that to the SQLAlchemy.

What you propose is to generate the id from the application. What about another application running in the same time, which wants to insert data into the same table? How should the application synchronize not to use the same ids? The serial column is very good in this case, as the nextval() call uses a lock internally, and only one process can call this in the same time.

However you could get a bunch of the numbers from the generator used in serial column, and insert this directly in the sqlalchemy. Unfortunately the only way to get 100 numbers from the generator is:

SELECT nextval('posts_id_seq') FROM generate_series(1,100)

so you need to call the generator 100 times. And you need to get the 100 numbers from the database to the SQLAlchemy. Insert to each object, and sent it in the generated query back to the database. This way you need to send the number twice instead of just one in the first case.

The problem with the generator is that it is not transactional (which is quite good, so when one transaction asks that to get the next value, then no other needs to wait for the first to finish). However due to the non transactional behaviour, you cannot make the queries like:

SELECT setval('posts_id_seq', nextval('posts_id_seq')+1000);

because there could be another nextval() call just between the nextval() and setval() in the above queries.

If I could have just one user using the database, then you're right, this could be faster, however I usually assume working on a real system, and in this case I cannot set the primary key outside the database.