YUHEE1984 / CatsOneday

1 stars 0 forks source link

11 - Intro to Databases and SQL #12

Closed TheBeege closed 1 month ago

TheBeege commented 3 months ago

Goal

Learn database modeling and SQL to provide a way to store our cat data.

This one is going to be a bigger, tough one. It will likely take a few days. Be aware that the code you write will not go into your project. This one is purely for your understanding and education, not as much for your portfolio. Don't go too deeply.

Context

First, I should give you a warning. I love databases. I consider this area my specialty, and this works well in today's age of big data and machine learning. I'm going to share a lot of context and concepts, possibly more than is strictly needed, but I hope I can cultivate a love of databases in you, too. I might also be flexing a little bit, in which case, I apologize.

What's a Database?

The word "database" is not well-defined and is used differently in different contexts. Theoretically, an Excel spreadsheet is a database. On the other hand, a globally distributed file storage system is also a database. The PostgreSQL Docker container you created last time is also a database. So let's clarify what we mean by "database" when we're talking about it today.

Let's constrain the idea of database to a software that stores and manages data. It manages the data on disk. It provides some way for other programs to interact with that data. This eliminates Excel, as Excel doesn't provide a way for other programs to interact with it (that I know of). But let's narrow things down further.

Within this definition of database, there are two main categories: relational (often called SQL) and non-relational (often called NoSQL). Relational databases include PostgreSQL, MariaDB, MySQL, SQL Server, Oracle, and CockroachDB. NoSQL databases include MongoDB, CouchDB, Cassandra, ElasticSearch, and neo4j. Both of these are certainly not comprehensive - there are many, many more databases out there. Today, we'll be focusing on relational databases.

Let's continue to narrow our focus. Even within relational databases, there are another two main categories: OLTP and OLAP. OLTP is online transaction processing, and OLAP is online analytics processing. OLTP is useful for web applications, where data requests are small in size but happen at a very high rate. OLAP is useful for analytics and reporting, where data quests are very large but happen less frequently. The SQL databases I mentioned above are OLTP. Some OLAP databases include DuckDB, Vertica, MonetDB, Apache Druid, Apache Pinot, and ClickHouse. There are also databases that don't fit well into either category, like the InfluxDB time series database. Since CatsOneday is a web application, we'll focus on relational OLTP databases.

So why am I explaining all of this to you? Mostly, it's to avoid other people confusing you when talking about databases and to help you understand the breadth of the domain. This level of complexity is why there are data engineers - people specialized in choosing, configuring, and managing various data systems.

Why did we choose PostgreSQL? PostgreSQL can do everything. It can run in OLTP and OLAP modes. It can support JSON queries like a non-relational database. It can run in a distributed deployment. It supports time series. It supports geolocation. Many people are starting to call PostgreSQL a "data platform" rather than just a "database." The best part? It's still free and open source.

Data Structure and Modeling

Tables

In relational databases, data is stored in tables. If you've ever used Microsoft Excel or Google Sheets, you've worked with tables; however, spreadsheet tables are much more flexible than relational database tables.

A table consists of columns and rows. Columns are often called "fields." Rows are often called "records." We'll use these interchangeably, and this is very common in day-to-day work. Fields are generally fixed with a table, where records are manipulated during the normal operation of the application working with the database. If you hear of someone saying SQL databases are inflexible compared to NoSQL database, they're talking about the fixed nature of the columns in database tables, among other things.

Relationships

A key difference between spreadsheets and relational databases is the relationships between tables. In spreadsheets, relationships between sheets/tables is possible, but it's not always the most important aspect of the software. (My target audience here is not finance people. The dark arts of finance spreadsheets are not the focus here, though they are respected.) In relational databases, relationships between tables are critical to the basic functioning of the application. When it comes to defining the structure of data, the relationships between data is the main focus.

One table roughly correlates to one "entity." An entity represents the concept of some kind of object in the application. In our case, a cat would be an entity, so we would make a table for cats. Once we've determined what entities are needed, we then decide what attributes they have. A cat has a name, a fur pattern, a sex, and more. We would define these as columns in our table. So far, this isn't very different from a spreadsheet table, right?

In our application, we'll likely have another entity: a cat owner. We should have another table, maybe called owners. Here is the key part: how are owners and cats connected? Many beginner engineers working with databases will initially think they should store the owner's data in the cat table. This may not have occurred to you because we started with the idea of "entities" from the beginning, but it is a very common mistake among learners. The ideal way (in OLTP databases) is to have cat and owner data in separate tables, then create a relationship between them. This is why they're called relational databases!

Many-to-One Relationships

How do we do this? It's not very intuitive, but it is quite simple once explained. Standard data modeling practice dictates that every table should have some sort of identifier - an ID. Except in the largest scale applications, this is usually just a number that counts up for each new record. So the first row would have an ID of 1. The second row would have an ID of 2. These IDs should be unique for every record. If we have a way to uniquely identify any record, we can reliably reference them from elsewhere. For our cats and owners, we could store the owner's ID (maybe as owner_id) in the cats table. In this way, a cats record can reliably reference an owner. Additionally, many records in the cats table could have the same owner_id, allowing many cats to share an owner. This is called a many-to-one relationship, and this is a critical concept in relational database modeling. The key thing to remember is that the *_id field goes on the many side of the relationship. There are many cats to one owner, so we store the owner's ID in the cats table.

Many-to-Many Relationships

This raises another problem. What is a cat has multiple owners? Maybe the cat belongs to a family, and we could say that an owner has many cats and a cat has many owners. This would be called a many-to-many relationship. It is a little more complicated to model in our tables. If we want this many-to-many relationship between cats and owners, we need a third table. This is called a "linking table." It links the cats and owners table in a many-to-many relationship. A linking table often consists of just two columns: the ID of the two entities it's connecting. In our case, it would just have cat_id and owner_id columns. That's it! In this way, we can connect cats and owners through this linking table. Often, linking table names are just a combination of the entities they're relating, like cats_owners. Some prefer to include some prefix or suffix to distinguish linking tables from entities, but that's on a per-team basis.

More on Relationships

Many-to-one and many-to-many are the most common relationships, but there is one further: one-to-one. This is often used in the case where an entity has a very large amount of detailed information but is not used often. The detailed information would be placed in a separate table, while the often-used information would remain in the main table for the entity. This is quite rare, but it's still good to know.

Designing relationships is a core skill in data modeling. There is a formal process of planning this design, called normalization (Korean). There are various stages of normalization, called "normal forms." The "Example of a step-by-step normalization" section does a good job walking through normalization (though it's missing primary keys, which is... a choice). Note that you will not normalize your tables for all normal forms. For example, 6NF is only useful for "cube"-style OLAP systems. Over time, you'll develop a natural feeling for what is and is not sufficient normalization. Honestly, I never learned the formal normalization definitions; I just learned what performs well in various situations and went with that. However, if I had learned formal normalization, I may have been able to avoid learning the more intricate details of how a database engine works. For you, if you prefer more academic, formalized knowledge, I recommend learning the normal forms. If you want to just figure out what does and doesn't work well, learn about database performance, instead.

Primary and Foreign Keys

There is the idea of a primary key. This is a unique value used to identify a record in a table. This sounds just like our justification for creating IDs in our tables, right? Primary keys are usually the ID of the entity. In linking tables, have a separate ID for each record doesn't make as much sense. In these cases, you can use a compound primary key, that is, a primary key consisting of more than one column.

In addition to primary keys, there are foreign keys. Foreign keys ensure integrity of data between tables. What if you try to delete an owner record that has cat records pointing to it? Defining a foreign key allows you to ensure that links like this are not broken. The absolute largest applications don't use foreign keys, as they incur a tiny performance cost, but for 99% of applications, the data integrity is more important than the infinitesimal performance gain. If you try to delete a record that would break relationships, the database will either throw an error or ensure that the dependent records are also deleted, depending on how you defined the key.

Diagramming

This may make sense now, but you may be wondering how to plan and track all of this for more complex applications. It's not easy. Often, engineers will visualize their data schema using a diagram called an ERD, or entity-relationship diagram. (Sorry, the Wikipedia article for this sucks, but this article is excellent.) The name is very descriptive: it diagrams entities and the relationships between them. Engineers usually spend more time trying to lay out these diagrams properly than actually writing the database code. I've found throughout my career that the more senior you become, the more time you spend at a whiteboard, and the less time you spend writing code.

Now that you have a very basic understanding of how data is structured in a database, let's talk about querying.

What's SQL?

SQL is structured query language. It is a standardized language, and the standardized version is often called ANSI SQL. ANSI is the American National Standards Institute. The current standard SQL definition is published by the ISO (International Standards Organization) under ISO 9075, and it updates quite frequently with the most recent version being released in 2023. Every database system adds functionalities on top of standard SQL, so SQL is not directly portable between database systems. However, the effort to port one database's SQL to another database is minimal compared to porting other types of software.

SQL is only used to retrieve and manipulate data. In addition to SQL, there is DDL, or data definition language. DDL defines the structure of the data. DDL has no standard and varies widely from database to database, though the concepts are generally the same. Today, you'll work with both DDL and SQL, but for the rest of this section, we'll focus on SQL.

SQL Verbs

SQL defines 4 main verbs for working with data: SELECT, INSERT, UPDATE, and DELETE. You'll often see these written in all capital letters. SQL itself is case insensitive for keywords like these, but old database people like me tend to use the all-capital way of writing. You can do it however you please.

Often, learners gain proficiency in these verbs in the order I listed above. This is due to the "destructiveness" of each verb. SELECT has the least potential for harm, while DELETE used improperly can bankrupt a company. Let's go over each one.

SELECT

SELECT is used to request data from the database. This is by far the most common query verb. Here's a quick example query:

SELECT
    name,
    fur_pattern,
    weight_in_kg
FROM cats;

This query would retrieve the name, fur_pattern, and weight_in_kg fields for every record in the cats table. You can also use * to specify every column, like SELECT * FROM cats;, but this is usually only used when exploring data.

INSERT

INSERT is used to add new records to the table. It looks like this:

INSERT INTO cats (name, fur_pattern, weight_in_kg) 
VALUES
    ('Spot', 'solid', 4.5),
    ('Simba', 'striped', 5);

The first set of parenthesis specifies which columns you are setting data for. This is important, as usually, we define tables to automatically set the id when we set a new record and don't want to specify the id ourselves. The remaining sets of parenthesis define new records that are being inserted. In this case, we're inserting two records. We can specify as many records as we want this way.

There's also another way to insert data based on a SELECT query's output. This is less common but still useful. This is a contrived, bad example in terms of real-world use, but it demonstrates the idea:

INSERT INTO export_cats (name, fur_pattern, weight_in_kg, country_of_origin)
SELECT
    name,
    fur_pattern,
    weight_in_kg,
    'South Korea'
from cats;

This would insert all records from cats into the export_cats table but would set the country_of_origin as a fixed value, 'South Korea'.

Filtering

Before we get into the next SQL verb, you should learn about filtering. Filtering is how you tell the database to only use an action on a subset of a table's data. We do this with the keyword WHERE. Once you see it, you'll realize it's quite simple.

SELECT
    name,
    fur_pattern,
    weight_in_kg
FROM cats
WHERE fur_pattern = 'solid';

How will this query differ from our original SELECT example? If you guessed, "it will only show cats with a 'solid' fur pattern," then you're right! We can do slightly more complicated things, too.

SELECT
    name,
    fur_pattern,
    weight_in_kg
FROM cats
WHERE weight_in_kg > 4.5;
SELECT
    name,
    fur_pattern,
    weight_in_kg
FROM cats
WHERE fur_pattern = 'solid' AND weight_in_kg > 4.5;

These two make some sense, right? The first will only show cats that are heavier than 4.5kg. The second will only show cats that are both heavier than 4.5kg and have a solid fur pattern.

SELECT
    name,
    fur_pattern,
    weight_in_kg
FROM cats
WHERE (fur_pattern = 'solid' AND name = 'Spot') OR weight_in_kg > 4.5;

This one is a little more complicated, as you need to use more logic. This will return any cats that both have a solid fur pattern and named spot, and it will return any cats heavier than 4.5kg. The parenthesis "groups" filters together. What if the filter just looked like this?

WHERE fur_pattern = 'solid' AND name = 'Spot' OR weight_in_kg > 4.5;

Would it return what I described above? Would it return cats that have a solid fur pattern and either have the name spot or are heavier than 4.5kg? There is a standard rule, but I don't know it. "But Beege, you said you're good at this! Why don't you know?" This is bad code. It's not clear. Adding parentheses makes it clearer. When you have AND combined with OR, always use parentheses.

Another filter thing:

SELECT
    name,
    fur_pattern,
    weight_in_kg
FROM cats
WHERE fur_pattern = 'solid' AND name <> 'Spot';

What the hell is <>? Well, SQL is not perfect. Do you remember != from Java? It's the same. <> is how we say "not equals" in SQL.

You could also write this a different way:

SELECT
    name,
    fur_pattern,
    weight_in_kg
FROM cats
WHERE fur_pattern = 'solid' AND NOT name = 'Spot';

This is the same. You might be thinking that NOT ... = would be more readable. I disagree. It's possible to miss NOT. While <> is not immediately understandable for people unfamiliar with SQL, it forces you to lookup what it means. This is better than the ambiguity of OR without parentheses. When reading quickly, it's much easier to miss NOT than it is to miss <>, so readers are less likely to make a mistake with <> than NOT.

Another filtering trick:

SELECT
    name,
    fur_pattern,
    weight_in_kg
FROM cats
WHERE fur_pattern = 'solid' AND name LIKE 'Spot%';

The LIKE operator is a text match with a wildcard. If you're not familiar, a "wildcard" means that it can match any value. In this case, the filter will match Spot, Spotty, Spotticus, or anything starting with Spot. Note that the % wildcard does not work with =! If you want to use a wildcard, you need to use LIKE.

Okay, last one for filtering:

SELECT
    name,
    fur_pattern,
    weight_in_kg
FROM cats
WHERE fur_pattern IN ('solid', 'spotted');

Maybe you can guess what this one means. Cna you guess another way to write it? An equivalent way to write this filter would be like below:

WHERE fur_pattern = 'solid' OR fur_pattern = 'spotted';

These are exactly the same. We usually prefer IN (...) because it's shorter, avoids the need for paretheses around conditions, and is still easy to read.

That's it for the basics of filtering! There are certainly more advanced things, but you can learn those on the job.

UPDATE

This command is the first dangerous one. If you don't use this properly, you can accidentally overwrite all of your data. Be careful.

The UPDATE verb overwrites data for existing records in the database. This is useful when some status changes or maybe some user data changes. For your safety, I'm going to teach you a trick I use to prevent making a terrible mistake. We're actually going to write a wrong, broken UPDATE first.

UPDATE cats
WHERE id = 2;

The above is not a valid update statement! I've only entered the table name and filter, which are not enough for an actual update. I do this because specifying the filter first prevents me from accidentally running the update without a filter. Running an update without a filter will update every single record in the table. This is usually not what you want. Here's an actual, complete update statement:

UPDATE cats
SET weight_in_kg = 4.2
WHERE id = 2;

This would actually run. This would update any records in the cats table that have an id of 2 by setting their weight_in_kg to 4.2.

You can also make multiple updates in a single statement:

UPDATE cats
SET
    weight_in_kg = 4.2,
    name = 'Skip'
WHERE id = 2;

This would update both the weight_in_kg field and name field for any records where the id is 2 (which would usually just be a single record).

That's it for the basics of UPDATE. While dangerous, it is relatively simple.

DELETE

This is the big, scary one: DELETE. It's even simpler:

DELETE FROM cats
WHERE id = 2;

That's it. Delete any records from the table matching the filters. Be very careful. Double-check any DELETE statements you run. If you accidentally miss the filter, you will delete all data in the table.

But that's pretty much it.

Joining

At the beginning of this issue, we made a big deal about relationships between entities in a database. JOIN is how we actually use these relationships. JOIN allows you to select records from multiple tables and combine them into records containing data from both tables. For example, if we had a cats table and an owners table, we can use JOIN to see what cats belong to what owners.

There are four kinds of JOIN:

With the exception of CROSS JOIN, they all look nearly the same. Let's see what joining a cats table to an owners table might look like.

INNER JOIN

SELECT
    cats.name,
    owners.name as owner
FROM cats
JOIN owners ON cats.owner_id = owners.id

A quick aside: notice the as owner. We can rename fields for display using the as <name> clause, where <name> is whatever you want to rename the field as. This renaming is only for the results of this individual query. It doesn't persist at all.

Remember when we were talking about many-to-one relationships? We're assuming here that many cats belong to one owner, so the cats table has a reference to the owners table's ID. But back to JOIN.

This is an INNER JOIN. Since it is the most common, simple JOIN is an alias for INNER JOIN. An INNER JOIN will match records from the two tables to each other based on the ON condition. LEFT JOIN/RIGHT JOIN and FULL OUTER JOIN will also do this. However, if a record from either table doesn't have a match, it will be removed from results. In this case, if there is a record in cats where the owner_id doesn't match the id of a record in the owners table, it will be filtered. Alternatively, if a record in the owners table isn't referenced by the owner_id of any records in the cats table, it will be filtered, as well. This way, only matches in both directions are selected.

LEFT JOIN / RIGHT JOIN

Let's take a look at the directional joins: LEFT JOIN and RIGHT JOIN.

SELECT
    cats.name,
    owners.name as owner
FROM cats
LEFT JOIN owners ON cats.owner_id = owners.id
SELECT
    cats.name,
    owners.name as owner
FROM cats
RIGHT JOIN owners ON cats.owner_id = owners.id

Left and right joins preserve all records on one side of the join, regardless of whether or not there's a matching record in the other table. Which is left, and which is right? It matches the direction we read in: we read from left to right. So the first tables/joins are the "left," and the joined table is the "right." In the above example, cats is left, and owners is right. In the first query, all cats records would be returned, and if there was no matching owner, the owners fields would be returned as null. For the right join, all owners would be returned, and any mismatched cats records would be null. If this isn't very clear right now, don't worry. It will make much more sense when you try it.

FULL OUTER JOIN

FULL OUTER JOIN is essentially a combination of LEFT JOIN and RIGHT JOIN. All records from both tables are returned. If there's a mismatch, the missing table record's fields are just null.

CROSS JOIN

This is the crazy one. CROSS JOIN doesn't use an ON clause. There is no condition necessary. CROSS JOIN joins every record of the first table to every record of the second table. Do not do this with large tables. This is a very expensive operation. The only time I've used these is when I need to generate "slots" of data, like displaying every hour of every day regardless of there being any data. We won't spend time on this because it's so rare, but it's good for you to know that it exists.

Aggregation

Lastly, let's cover aggregations. Often, you'll want to see some kind of summary of your data. How many cats are there? How many cats does each owner have? What is the average weight of a cat? Aggregations allow us to answer these questions.

SQL provides a number of aggregation functions, including but not limited to:

However, these functions must be combined with GROUP BY whenever you want to "bucket" your aggregates according to some field. Let's take a look at how to check the number of cats per owner:

SELECT
    owners.name as owner_name,
    COUNT(cats.id) as count_of_owned_cats
FROM owners
LEFT JOIN cats ON cats.owner_id = owners.id
GROUP BY
    owners.name;

This will show us all owners and the number of cats they own. Note the LEFT JOIN! Why do we do a left join? We want to make sure that we get all owners, even if they have no cats.

What if you wanted to count records even if there were no cats? You could use COUNT(*). This will count if a record exists, regardless of any nulls. Our query above ignores records where cats.id is null, when there is not corresponding record in cats for the current owner.

What if we want the average weight of all cats?

SELECT
    AVG(weight_in_kg)
FROM cats;

Notice that there's no GROUP BY. We're not bucketing by some field. We're just looking for the aggregate across the whole table, so GROUP BY is not necessary.

Indexes

Let's talk about database performance! When your tables have thousands of records, things can get a little slow. How do we make things fast? Indexes! "But Beege, what the hell is an index?"

You know books. In the back of a book, they often have an index. The index lists keywords in order and tells you what pages you can find them on. A database index is largely the same. You can define an index for one or more fields of a table. For example, let's say that we'll often want to find cats based on their name. We can put an index on the name field of the cats table. Internally, the database will create a list of every single value in the name field of the cats table. It will track a reference to each record containing that name value. For example, if there was a name like "Spot," then "Spot" would appear in the index along with pointers to where the records containing the name as "Spot" are located in the table. This allows the database to search the index when filtering by name, which is much faster than filtering the table. Once the database has found the value in the index, it can directly pull the corresponding records out of the table without existing work. This is what causes relational databases to be really fast.

It's also worth noting that one index can track multiple fields. This is called a "compound index." So let's say that you'll only every check the color of a cat based on its fur pattern. In that case, you'd want to create an index on both the fur pattern and color. Note that the order matters! If the index is first fur pattern then color, it can be used when filtering both fields or when filtering just the fur pattern. If you want to filter based on color alone, the database won't use this index!

Keys

Last part! Keys are like special indexes. There are two main types of keys: primary keys and foreign keys.

Every table should have exactly one primary key. A primary key is much like an index, but it's meant to serve as the main identifier in the table. You might be thinking, "like the id field?" And you'd be correct! Usually, we use an id field in the table as the primary key. In the case of linking tables (remember many-to-many relationships?), the primary key will consist of two fields. Just like compound indexes, order is important for keys when you choose multiple fields. In the case of linking tables, you'll often define a compound index on the two fields in the opposite order of the primary key. This would allow you to search from both "directions" with equal performance.

Foreign keys enforce relationships between tables. Foreign keys allow you to specify that a column in one table references a column in another table. In the example we've used here, we would have a foreign key on the owner_id column of the cats table referencing the id column of the owners table. If we tried to delete a record in owners that's referenced by a record in cats, the database would give us an error. However, you can specify that any upstream records that depend on the record to be deleted should also be deleted. This isn't very common, but you can look this up if you need it. The keyword is CASCADE, whenever you need to look it up.

Conclusion of SQL

You've learned a lot here. Don't expect to memorize all of it immediately. In fact, you'll rarely use this directly if you're working in Spring. Spring will do most of this for you. It's just important that you understand the logic of what's happening and how it can be constructed.

Enough reading. Let's get you coding!

Steps

Setup

  1. Go to https://www.db-fiddle.com/. This is a website for experimenting with and sharing database work.
  2. At the top left, select PostgreSQL v15.
  3. In the right pane, type SELECT 1;.
  4. At the top left, hit Run.
  5. At the bottom, you should see some table results, where the field name is ?column? and the value is 1. Great! You've run your first SQL query!

Creating Our Tables

  1. In the left pane, "Schema SQL," enter the below:

    CREATE TABLE cats (
        id INT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
        name VARCHAR NO NULL,
        fur_pattern VARCHAR NOT NULL,
        weight_in_kg NUMERIC(3, 2) NOT NULL
    );
    
    INSERT INTO cats (name, fur_pattern, weight_in_kg)
    VALUES
        ('Bo', 'solid', 3.8),
        ('Tigger', 'striped', 4.3),
        ('Spot', 'spotted', 3.6);
  2. Enter SELECT * FROM cats; in the right pane.
  3. Hit "Run."
  4. See the results in the bottom of the page.
    • What the hell did we just do?
      • Well, CREATE TABLE does what you can guess it does. The parts in parenthesis are the fields in the table.
      • GENERATED BY DEFAULT AS IDENTITY is used for IDs in PostgreSQL. It automatically counts up from 1, and you don't need to insert a value. The BY DEFAULT part allows statements to set a value if desired. You can disable this to allow only PostgreSQL to set the value using ALWAYS in place of BY DEFAULT.
      • The PRIMARY KEY tells the database that this is our primary key.
      • Each field is separated by a comma ,. We add the newline for readability.
      • VARCHAR is short for variable-length character. In this context, "character" is characters in a string. We specify variable length instead of simply characters because simple CHAR will fill any empty space with null characters. You generally always want to use VARCHAR.
      • NUMERIC is PostgreSQL's precision decimal data type. The first number specifies the maximum number of total digits. The second number specifies how many decimals to track. So this can track from -9.99 up to 9.99. That's three total digits where two of them are after the decimal.
      • NOT NULL indicates that we can't leave this field empty. Note that PRIMARY KEY fields cannot be null by default.
    • Note that the language to define tables, DDL, is very different from database system to database system. The above would not work in MySQL or MS SQL Server!
  5. Let's create the owners table as below and populate it with some data. For later ease, put this before the cats table!

    CREATE TABLE owners (
        id INT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
        name VARCHAR NOT NULL
    );
    
    INSERT INTO owners (name)
    VALUES
        ('Yuhee'),
        ('Beege');
  6. In the right pane, add the statement SELECT * FROM owners;. The order doesn't matter - we're just testing.
  7. Now, let's create the relationship between cats and owners. In your CREATE TABLE cats statement, add a new column like this:
    owner_id INT NOT NULL
  8. Then add a foreign key like this (don't forget commas in the right places):
    FOREIGN KEY (owner_id) REFERENCES owners (id)
  9. Lastly, add references to owners in our cat data. Update your insert to look like below:
    INSERT INTO cats (name, fur_pattern, weight_in_kg, owner_id)
    VALUES
        ('Bo', 'solid', 3.8, 1),
        ('Tigger', 'striped', 4.3, 1),
        ('Spot', 'spotted', 3.6, 2);

Great! Now, you have some tables to work with. We just skimmed over the DDL to define the tables, but people usually just look it up when they need it. Usually, Spring will help you with setting up tables, so you don't need to worry about it too much. In any case, well done! Now you have some tables to store cats in.

Querying Data

Now that we've stored some data, let's retrieve it! Feel free to clear the queries from the right pane if you prefer to keep things clean. Here, I'm not going to tell you exactly how to write these queries. If you get stuck, you can look towards the end to see answers. Don't forget to be efficient! If a step takes you more than 10-20 minutes, just check the answer. Being right isn't important now; learning is.

  1. Write a query to get the name and fur pattern of all cats in your cats table. Run it, and check the results!
  2. Change that query to include the name of each cat's owner. Run it, and check the results.
    • Note that the name field will be ambiguous. You'll need to specify which column it belongs to. We did it in our examples.
  3. Add a filter to only show cats owned by 'Yuhee'. Run it, and check the results.
  4. Write another query that shows the number of cats each owner has.
  5. Add another query that shows the average weight of each owner's cats.

Nicely done! This shows that you have an idea of how data is stored in tables and how to retrieve it. Let's do some work with our other SQL verbs, too.

Changing Our Data

  1. Beege is allergic and can't keep his cat. Before your queries, add an UPDATE statement that changes ownership of all of Beege's cats to Yuhee. Run the queries, and make sure the data changed.
  2. Andrei is taking Tigger, and he doesn't want to be part of this system. Between your UPDATE and SELECT queries, add a query to delete Tigger. Run the queries, and make sure the new data makes sense.

Great job! You've learned the basics of SQL! There is much, much deeper knowledge, but you don't need that, yet. We kind of rushed through this because Spring JPA will do most of this work for you, but it's important to understand why and how Spring does things. Next issue, we'll get back to working on your project directly.

Query Answers

Querying Data 1

SELECT
    name,
    fur_pattern
FROM cats;

Querying Data 2

SELECT
    cats.name,
    fur_pattern,
    owners.name as owner_name
FROM cats
JOIN owners ON cats.owner_id = owners.id;

Querying Data 3

SELECT
    cats.name,
    fur_pattern,
    owners.name as owner_name
FROM cats
JOIN owners ON cats.owner_id = owners.id
WHERE owners.name = 'Yuhee';

Querying Data 4

SELECT
    owners.name,
    count(cats.id)
FROM cats
JOIN owners ON cats.owner_id = owners.id
GROUP BY owners.name;

Querying Data 5

SELECT
    owners.name,
    avg(cats.weight_in_kg)
FROM cats
JOIN owners ON cats.owner_id = owners.id
GROUP BY owners.name;

Changing Our Data 1

UPDATE cats
SET owner_id = 1
WHERE owner_id = 2;

Changing Our Data 2

DELETE FROM cats
WHERE name = 'Tigger';

Other Notes

On NoSQL Databases...

If anyone ever says, "Let's use MongoDB because it's flexible," stop them! This is a very common misconception. It is true that changing the structure of data stored in MongoDB is very easy. However, MongoDB requires that you understand how you'll use the data before you start storing it. This is because you should structure the data according to how you'll read it. With relational databases, the structure of the data is fixed, but usage is flexible such that you don't need to know in detail how you'll read it. My team is currently going through the pain resulting from this mistake. This is a very common mistake in startups, so be aware in order to prevent it.

YUHEE1984 commented 1 month ago

I'm not going to give 'Tigger' to Andrei.

YUHEE1984 commented 16 hours ago

Review completed