TheOdinProject / curriculum

The open curriculum for learning web development
https://www.theodinproject.com/
Other
9.8k stars 13.15k forks source link

Javascript: HashMap Data Structure: Illogical Content Order and Phrasing #27174

Open columk1 opened 8 months ago

columk1 commented 8 months ago

Checks

Describe your suggestion

I found a number of phrases in this lesson to be ambiguous or misleading.

1. Use of nodes before linked list is mentioned.

To get a value using a key, we put each entry inside a bucket as a Node item, which holds both the key and the value. To retrieve the value, we hash the key and calculate its bucket number. If the bucket is not empty, then we go to that bucket and compare if the Node’s key is the same key that is already in the bucket. If it is, then we can return the Node’s value. Otherwise, we return null.

It's not clear what a "Node item" is here since at this point we haven't been told about the linked list. It's also unclear why the key would be stored in the node if there is only one value per key. The whole process described here seemed out of place to me. It's hard to understand what a hashmap actually is when the descriptions are already factoring in collision resolution strategies before the idea of collisions has been presented.

Suggestion I think this whole "Buckets" section is ambiguous and that all references to nodes before the collisions section should be removed. The core concept of a hashmap should be presented first, i.e. describe it as an array where the index is the hashed key and the value at that index is just the value, not a node. The value gets overwritten if a new key is added with the same hash.

If we really want to introduce the idea of storing the key and value together before the idea of the linked list, we can include it as a potential way to protect against overwriting a value when a key produces the same hash as another key. However, I think it should either be moved to the next section or mentioned as a brief introduction to collisions with a caveat that a much better method will be described in the next section.

Following this, in the Collisions section, we can introduce the idea of collisions and collision resolution. Once the idea of the linked list is presented, we can say that now the array will hold pointers to nodes instead of the values themselves, and that each node will include both the key and the value. Then we can describe checking the keys, checking the next node, returning null if not found etc.

2. Confusing syntax.

We are advised to check if the "Node's key is the same key that is already in the bucket" and to return null if it is not. This is also confusing to me. Is there a key in the bucket aside from the one in the node? Presumably the intended meaning is to check if the node's key is the same key that was used for the retrieval. Or the same key that was hashed to get the index of the bucket.

3. Capitalization of node and nodes.

These words appear 12 times and are capitalized 8 times, with no clear pattern. I would assume they should all be lower case unless in the context of specifically saying to create a Node class or factory function and then referring to it. The following section is the clearest example:

As we continue to add nodes into our buckets, collisions get more and more likely. Eventually, however, there will be more nodes than there are buckets, which guarantees a collision (check the additional resources section for an explanation of this fact if you’re curious).

Remember we don’t want collisions. In a perfect world each bucket will either have 0 or 1 Node only, so we grow our buckets to have more chance that our Nodes will spread and not stack up in the same buckets. To grow our buckets, we create a new buckets list that is double the size of the old buckets list, then we copy all nodes over to the new buckets.

The first issue is broad and will probably require input from a maintainer if it is considered, but I can create a PR for these last two issues. Let me know if everything should be capitalized or lowercase, or if there is a heuristic to use for this. Likewise, please advise if there are any other suggestions for the phrasing in the second issue.

Path

Node / JS

Lesson Url

https://www.theodinproject.com/lessons/javascript-hashmap-data-structure

(Optional) Discord Name

columk

(Optional) Additional Comments

No response

JoshDevHub commented 8 months ago

Thank you for making this issue and including thorough descriptions of your problems + proposing solutions. Awesome stuff @columk1 :raised_hands:

I think I agree on all points. You're correct that your first item is going to be the trickiest to get through. I'm going to reread the full lesson, think about your suggestions, and then we can collaborate on a plan for that.

In the meantime, if you want to go ahead and put in a PR for those other two things, feel free to do so. Just two points regarding that:

  1. Under the Issue heading in the PR template, say Related to #27174 instead of Closes #27174. This will prevent this issue from being automatically closed when your PR is merged. We still want to keep this issue live for dealing with your first point.
  2. This lesson also exists in the Ruby course and has mostly the exact same content (just with code examples given in Ruby). It has the same issues with "node" capitalization and the confusing bit about comparing keys. Would you be okay with fixing those issues in the Ruby lesson as well? You shouldn't have to know any Ruby to do this.
silenceinspace commented 8 months ago

Hi there. Since a bit older relevant issue #27103 (also improvements for the lesson-project combo) did not get any response later in the discussion, I would like to mention it here once again. I hope you don't mind me doing it. I personally believe it may be helpful to handle issues for TOP in that way so as not to end up with 90-ish ones :D (I do not have any experience with being a regular contributor with write access on a large project like Odin. So the number of issues looks a bit overwhelming to me. Would you say it has a lot to do with the scarcity of competent people?)

mark-P0 commented 8 months ago

Use of nodes before linked list is mentioned.

It's not clear what a "Node item" is here since at this point we haven't been told about the linked list.

About these points specifically... the concepts of Nodes and Linked Lists were explored in the lesson and project just before the Hash Map project. So I guess it assumes the reader has gone through those and is already familiar with them. Whether those assumptions should be made, I do not have an opinion on.

columk1 commented 8 months ago

the reader has gone through those and is already familiar with them

I didn't mean awareness of linked lists in general. I mean that students aren't expecting a linked list here. They won't know that using a node in this case is a specific implementation of a hashmap that is used to prevent collisions, different from a regular hashmap. My suggestion is to describe a regular hashmap first, then we can present this implementation of a linked hashmap after the issues with the original are discussed along with the ideas of collisions and collision resolution.

github-actions[bot] commented 7 months ago

This issue is stale because it has had no activity for the last 30 days.

JoshDevHub commented 6 months ago

Alright sorry for my speed on getting back to you about all of this, but I have some bandwidth now to work and assist with this task.

Are you still able to work on improving this lesson @columk1 ?

columk1 commented 5 months ago

Yes, I can work on this. I'll look back over the lesson and make some notes. Let me know how we can collaborate.

JoshDevHub commented 5 months ago

Great to hear @columk1

I'll wait for your notes and give some of my own thoughts. Broadly, I think the idea you lay out in the opening post is a good one. Terminology around nodes and linked lists should await the section on collisions. I would endorse a structure that does something like:

... Omitting earlier content

Buckets

Mention hashing to generate an index position for the hash map's internal array structure.

(the insertion order note can probably just be moved to the project, since it's more relevant there)

Collisions

What is a collision and why is it a problem (different keys can hash to the same position, causing unpredictable overwrites)

Resolving Collisions

Can introduce linked list and node stuff here as a strategy for managing collisions. Can also draw attention to how collisions reduce efficiency of reads/writes to the hash map. This reinforces the importance of good hashing algorithms that achieve high variability among the possible range of values (can maybe move this "Sara"/"raSa" example to this section). This also leads into the next section...

Growing the Hash Map

Explore why this is a good idea and how it is done. #27682 raises a nice point about the way this information is currently given.

columk1 commented 5 months ago

Sounds good. I would make a few changes to the first few sections too. The key does not have to be stored or checked until the Linked List is introduced.

Introduction

A hash map takes in a key value pair, produces a hash code ~from the key~, and stores the ~pair~ value in a bucket.

Also include some more high level description. Something like "A hash table supports various operations including insertion, retrieval and removal." Also something like "It can be a highly efficient data structure. The time complexity of a hash table lookup is O(1) i.e. constant time."

Use Cases

This number will serve as the index to the bucket that will store the ~key value pair~ value. More on buckets in the next section.

Buckets

Likewise, remove mentions of nodes and storing/checking keys.

The hash function returns a number that serves as the index of the array at which we store ~this specific key value pair~ the value that corresponds to this key.

The examples for insertion and retrieval should simply be hashing the key and then either returning the value at that index or replacing it with a new value.

Also, the three paragraphs after the examples don't really make sense:

  1. Maybe you are wondering, why are we comparing the keys if we already found the index of that bucket? Remember, a hash code is just the location. Different keys might generate the same hash code. We need to make sure the key is the same by comparing both keys that are inside the bucket.

This is referring to collisions and should be moved to the next section.

  1. This is it, making this will result in a hash table with has, set and get.

This sentence is a bit strange. Also, these specific functions are discussed in the project. Two pseudo code functions (insertion and retrieval) were discussed in the examples above.

  1. What if we found the hash code, but also the key value is the same as what we already have in the bucket? We check if it’s the same item by comparing the keys, then we overwrite the value with our new value. This is how we can only have unique values inside a Set. A Set is similar to a hash map but the key difference (pun intended) is that a Set will have nodes with only keys and no values.

Can move this to the collision section. It could be replaced with a brief note about values being overwritten when inserting at an existing index.

I think the note on insertion order is fine here.

Collisions

We have another problem that we need to deal with: collisions. A collision occurs when two different keys generate the exact same hash code. Because they have the same hash code, they will land in the same bucket.

This intro doesn't make sense since this has already been discussed above but if we move some of those paragraphs from the buckets section to this section it should be fine. I would add that even if two different keys don't generate the same hash code they can still generate the same index and cause a collision.

Main change: Introduce the idea of using a Linked List in the 'Dealing with Collisions' section or a under a new heading, and only then say that we now need to store the original key along with the value. Mention the reason - so that we can tell which node to stop at when we search through the list.

I agree with Josh about mentioning the reduced performance of a linked list compared to a hash table. A hash map with a few large buckets will be much slower than a hash map with lots of small buckets. We could mention the time complexities. One bucket with a long linked list would be O(n), not O(1). Could also mention some features of a good hash function (Should use all input data, should uniformly distribute data across the entire set of possible hash values, etc...).

Include the paragraph about the Set object from above at the end of this section.

Growth of a Hash Table

To grow our buckets, we create a new buckets list that is double the size of the old buckets list, then we copy all nodes over to the new buckets.

To address #27682, This could be changed to "then, using the insert operation, we copy all of our nodes over to the new buckets" or "then we insert each node in to the new list". Something to make it explicit that the buckets won't have the same index as before.

Mclilzee commented 5 months ago

I don't believe we should tie the word Node with a linked list. In CS the word Node is broadly used to provide information to the reader that this is a struct kind of an object or class that holds data. Take the DOM as an example, the word Node is used extensively to talk about HTML elements. Another example is the runtime Node.js is not tied to a Linked list. People had to learn what is a node during Linked List too, it didn't come with it.

As of moving things around to be after mention of collision that is a bit of a problematic approach. We cannot really talk about collision without understand buckets, and checking for keys when we look at nodes is important concept for a bucket and not tied to collision. Collision detection just safe guard us into overwriting and fitting more data into our hashmap. We still need to check if the key were the right one saved in our bucket even if we don't have collision yet.

a Hashmap is a data structure that is understood as a whole, it is a bit difficult to understand some parts without seeing the full picture, but at what point would be simplifying it too much make us start losing important information.

columk1 commented 5 months ago

Looking through other educational resources I only see nodes mentioned when separate chaining is introduced. Likewise I mostly see mention of keys being stored at that point too. Lots of diagrams showing this change - first storing the value, then the key and value together for separate chaining. en.wikipedia.org/wiki/Hash_table for example. That's why I was so confused when I first read the lesson. However, I understand if others think it's simpler to start with something closer to what is taught at the end of the lesson. Certainly most hashmaps in the wild store the key and value together.

Regarding collisions, is checking keys not tied to collision? If no two keys could have the same hash then it wouldn't be necessary to check. From the buckets section in the lesson:

Maybe you are wondering, why are we comparing the keys if we already found the index of that bucket? Remember, a hash code is just the location. Different keys might generate the same hash code. We need to make sure the key is the same by comparing both keys that are inside the bucket.

And then the same idea is repeated later as an introduction to collisions:

We have another problem that we need to deal with: collisions. A collision occurs when two different keys generate the exact same hash code. Because they have the same hash code, they will land in the same bucket.

Mclilzee commented 5 months ago

Regarding collisions, is checking keys not tied to collision? If no two keys could have the same hash then it wouldn't be necessary to check. It is a yes and no. A hash map is not complete without collision handling, but likewise a partially complete hash map can work "although none functional data structure". And it will depend on your rules to what to do with that hash map buckets. We could have multiple options, even without collision we might still want to see if this map has a specific key, if I don't store the key to compare it then I might get a false positive that my bucket contains a key but what is happening is the new key hashing to the same value without being equal to the old key.

You even might want to overwrite the old key only when it's the same key, or only if it's a different key all these operations can only work by comparing both keys. Remember that a hash code is just the bucket location, it doesn't explain anything about the key, value pairs being stored there. Removing the key from the equation would just make the whole structure seem to be just an array. Where you're writing a hash function to find an array index to insert your things into which is a different data structure.

In some languages this would be called a Dictionary so Let me illustrate my point using a dictionary. If you want to translate a word, you look up the first letter of the word, and then you find out on the first page which page that letter is on (simulating hashing the word to a location). You go to that page, and you find out that this page is empty, and it only has 1 translation in it. Would you be comfortable only looking at the definition? Definition of what exactly? What if the word that this dictionary holds is not the same word you want to translate although they both hash to the same value and the page has no collisions in it, you still want to make sure it's your word.

TL;DR A collision solution is just meant to allow us to store multiple values per bucket, so that we are not limiting to overwriting and locking down our buckets when there are multiple keys with the same hash. Doesn't mean our buckets can't work without collision solutions.

columk1 commented 5 months ago

Where you're writing a hash function to find an array index to insert your things into which is a different data structure.

That's what a hashmap is, an associative array that uses a hashing function to determine the index. It's just a hack on an array to allow flexible keys instead of sequential indices. Each individual implementation expands on it from there. Of course, collisions must be handled in almost all cases since there's no perfect hashing function and rarely a scenario that guarantees no collisions.

What if the word that this dictionary holds is not the same word you want to translate although they both hash to the same value and the page has no collisions in it, you still want to make sure it's your word.

This is a collision.

TL;DR A collision solution is just meant to allow us to store multiple values per bucket

I think you mean multiple keys that produce the same hash. There are many other collision resolution strategies based on open addressing that only store one value per bucket.

Anyway, regardless of any particular implementation, my main point is that the core idea should be introduced first, as is done elsewhere. Then we can mention that some keys will produce the same hash and introduce storing and checking keys, using linked lists etc. to account for this.

Looking back over the lesson, I think if something like "We also store the keys so that we can check them later, given that two unique keys could produce the same hash code." was inserted at the beginning of the buckets section, it would be fine to leave the pseudo code examples and mentions of storing keys as they are currently minus the mention of nodes. So instead of pushing everything to the collisions section, we could introduce collisions even sooner so that the pseudo code examples make sense as they are.

JoshDevHub commented 5 months ago

I can kind of go either way on this. Even without collision concerns, storing keys would be useful. If you don't, then things like Object.keys() or for...in can't work, and those have important utility that's completely unrelated to resolving collisions.

But I also see a lot of value in simplifying the upfront explanation. The pseudo code for accessing a value can be placed after collision resolution has been discussed if we want, this way, it's fully accurate to the actual implementation.

github-actions[bot] commented 4 months ago

This issue is stale because it has had no activity for the last 30 days.