dwyl / learn-postgresql

🐘 Learn how to use PostgreSQL and Structured Query Language (SQL) to store and query your relational data. 🔍
212 stars 23 forks source link

How much space do null values take up in PostgreSQL? #49

Open nelsonic opened 5 years ago

nelsonic commented 5 years ago

Many people chose to use a non-relational ("NoSQL") database (such as MongoDB or Cassandra) from the start of their project/app because they want the "flexibility" of not having a "rigid" schema. From experience we have observed that the vast majority of Apps can actually hugely benefit from developers thinking about schema up-front rather than "dumping data into mongo" and dealing with out to query and relate it later.

One of the common questions when using PostgreSQL is around having "empty" fields in a table. Imagine that you have a schema for a person (a human being that uses your SaaS product). That person might have a wide range of data associated with them. see: https://github.com/dwyl/fields/issues/4 Let's consider a typical record:

As you can see from this basic "person" record, Alex does not have an instagram or facebook account. Both columns in the table will be null:

inserted_at name email (PK) facebook mobile instagram twitter website
1541609554 Alex alex@gmail.com null +44789654321 null alex hialex.com
1541609876 Jo jo@dwyl.com null +1212456789 jopix joblogs iamjo.net

Question: Does having null values take up a lot of Disk Space?

Answer: No, Each null value only uses one bit on disk.

For argument's sake, let's represent the data this way "on disk":

1541609554|Alex|alex@gmail.com||+44789654321||alex|hialex.com|

So in the above example, the Alex's row of data in the people table is 62 bytes or 496 bits:

image

In the scheme of things, having a few null columns will not have a major impact on the disk usage. If your database table has 30 such default null fields, it would still only be 30 bits per row. In general, we will only add a fields to a table/schema that has a high likelihood of containing data. We aren't going to have a field for myspace ...

Obviously it would be even better if null data took up zero bits, but sadly that is not realistic. Because the underlying data storage needs a way of "representing" the null value.

Conclusion: having structured relational data queries is "worth it" for a few bits of null data.

I feel that the "trade off" of having structured relational data (when appropriate) is worthwhile for the sane querying and having a few bits of null data in the rows, is a "fair price".

We should re-visit this hypothesis in a few months if we notice any "slow queries".

Relevant Stack Overflow / Databases Q/A



Note: some of NoSQL databases have good use-cases for specific types of Apps, data & querying. e.g: Neo4J https://github.com/dwyl/learn-neo4j for Graph data or Elasticsearch for Full-text Search: https://github.com/dwyl/learn-elasticsearch That is not in question here and is a topic for another issue/discussion. Most people who are starting out with building apps should learn how to store data in a structured way to ensure they understand how to query/analyse it and avoid confusion about related data.

chapmanjacobd commented 4 years ago

null = 1 bit variable/feature/column = 8 bits in ASCII_land or 8 to 48 bits in UTF-8land per character_

NoSQL with a pure implementation takes up way more space.

Let's say for example you have this sparse dataset:

[ key1 int, key2 text, key3 int, key4 text, key5 int ]

Each column name would be repeated within the document storage if the value is present.

In best case scenario there is only one value per document:

{ key1: 23 }

because the column name is 4 characters long it is (4 * 8) = 32 bits to specify the column name. That means you could have 32 empty columns in Postgres before the row would equal the same bits to specify which key the value belongs to.

In the worst case scenario you would specify all columns:

{ key1: 23, key2: "Bobby drop_tables)eliteHaxx", key3: 42, key4: "1040z", key5: 199 }

In this scenario (4 5) = 20 characters are used to spell out the column names. That's 20 8 bits! You could have 160 null columns in a row before the size would equal the unoptimized NoSQL implementation.

But once you reach 800 or so columns (up to 1,600 simple columns) in Postgres then you will run into the max block size of 8KB. So it's better to store sparse data as JSONB data type.

Disclaimer: I'm not sure how variables are stored and processed in an actual implementation but I'm assuming MongoDB must not actually store raw JSON. If so then my previous statement has less weight. Postgres and other relational OLTP systems can often store JSON too (I think JSONB is a lot better than storing many sparse columns). I'm pretty sure the actual implementation is optimized fairly well