erikamaker / pickaxe

Study notes and exercises from Programming Ruby 3.2
0 stars 0 forks source link

Chapter 4 Notes Feedback #6

Closed ianterrell closed 1 year ago

ianterrell commented 1 year ago

The text explains that, naturally, most programs handle collections of data.

There's a web dev joke that all software is just about transforming one form of JSON to another form of JSON.

Each object reference holds a position in the array, and it's identified by an integer index. I'm not sure how deep this goes, but I'm noticing a similarity with how items are indexed in an array with how objects are held in memory.

This is why I think you're likely to become a good developer. This is insightful and goes about as deep as possible.

In a lower level language like C, arrays are implemented directly as contiguous blocks of memory. Ruby arrays at least used to do so also, and I expect they still do, albeit depending on the implementation and with many optimizations/etc.

What this means is that since you know how big the item in your array is, you only have to store the memory location of the first item in order to know where all of them are!

Simplified, given x = [5,4,3,2,1], you know that x[0] is at the memory location pointed to by x, and x[2] is at x + 2 * sizeof(integer).

This is important because it means you can find your object in constant time. Eventually you'll get to algorithmic complexity; but this is effectively as fast as possible; O(1).

That behavior is not true for all data structures (everything has tradeoffs).

I remember being confused for the longest time about two-period ranges including the end position whereas the 3-period range didn't (I know it's arbitrary, but in 2020 I thought they should be reversed--I've since grown used to it).

Other languages do it differently and perhaps more intuitively. Here's Rust: https://doc.rust-lang.org/reference/expressions/range-expr.html

They use a..b to be exclusive, and a..=b to be inclusive.

Which language does what syntax is something I have to remind myself of all the time. Knowing that both exist and you have access to both is the most important part; how you communicate it can always be referenced.

I learned that the [] pairing is the bracket operator. ... The [] operator has a corresponding []= operator, ...

Can you implement a type that has its own versions of these methods? Perhaps something that delegates to an array internally?

like, something where this would work:

irb(main):039:0> hello = Word.new "hello"
=> hello
irb(main):040:0> hello[0]
=> "h"
irb(main):041:0> hello[0] = 'y'
=> "y"
irb(main):042:0> hello
=> yello

(The only other method that powers that is inspect.)

I can use a %w ...

Also always helpful to type out in a REPL (irb) to see what it looks like directly:

irb(main):043:0> %w[ant bee cat dog elk]
=> ["ant", "bee", "cat", "dog", "elk"]
irb(main):044:0> %i[ant bee cat dog elk]
=> [:ant, :bee, :cat, :dog, :elk]

Arrays are useful in many ways. I can use them as stacks, sets, queues, dequeues, and FIFO queues.

This is getting into data structures, and is a useful line of inquiry as you keep learning. If you're into it, it's worth trying to build wrapper classes for each data structure listed. e.g.

irb(main):056:0> stack = Stack.new
=> #<Stack:0x000000010398abf8 @data=[]>
irb(main):057:0> stack.push 1
=> [1]
irb(main):058:0> stack.push 2
=> [1, 2]
irb(main):059:0> stack.push 3
=> [1, 2, 3]
irb(main):061:0> stack.pop
=> 3
irb(main):062:0> stack
=> #<Stack:0x000000010398abf8 @data=[1, 2]>
irb(main):078:0> x = MySet.new
=> #<MySet:0x000000010054b8b0 @data=[]>
irb(main):079:0> x.add "a"
=> nil
irb(main):080:0> x.add "b"
=> nil
irb(main):081:0> x.members
=> ["a", "b"]
irb(main):082:0> x.add "b"
=> ["a", "b"]
irb(main):083:0> x.members
=> ["a", "b"]

It's not something you really do much of in Ruby, but it's worth thinking about and understanding.

Arrays are advantageous in that they can use ANY object as an index. Ruby remembers the order in which an item is added to a hash. When you iterate over the entries, Ruby will return them in that order.

This surprised me-- don't arrays behave the same way? DOn't they remember the order in which something is added to the array? If something is pushed to an array, it's always pushed to the end. What other way would it order them (besides from the starting index)?

The author wrote that because it's definitely different from other languages. Different languages make different guarantees in how they order associative arrays/hashes/dictionaries:

The key insight here is that arrays are naturally ordered, as they're indexed by position. x[1] is after x[0], both obviously and because they're stored in memory that way.

But which is first in a hash, x[:foo] or x[:bar]? Uh, what does "first" even mean in a hash?

In this case, they might not be stored in memory in that order; :foo's hash function might return 2982782635 and :bar's might return 2983785918, and then who knows how those numbers get mapped to memory locations.

Real world analog: If I gave you a bunch of labeled folders and asked you to sort them, you'd have to ask me... "sort them by what?" Alphabetical by label? Alphabetical by the contents of the folder? By how many sheets of paper are in each folder?

And then if I replied, "No no no, just the natural way" that probably wouldn't do much to clear it up. "Well obviously the natural way is the order in which I handed them to you."

Maybe it's natural, maybe it's not natural; it's a matter of opinion. Different languages take different opinions. If a language makes guarantees about it, you can write your code to take advantage of them.

erikamaker commented 1 year ago

> There's a web dev joke that all software is just about transforming one form of JSON to another form of JSON.

Looks like I have some more to learn. I've seen the term JSON floating around, and know it refers to data and JavaScript, but not much else.

This is insightful and goes about as deep as possible.

It's really rewarding to realize those connections, thank you for pointing out that they're happening.

Simplified, given x = [5,4,3,2,1], you know that x[0] is at the memory location pointed to by x, and x[2] is at x + 2 * sizeof(integer). ... That behavior is not true for all data structures (everything has tradeoffs).

Would this be the case with structs? I was just learning about those in chapter 4.

If you're into it, it's worth trying to build wrapper classes for each data structure listed. e.g.

I poked around to try and learn what a wrapper class is and how to implement one, but a couple sources said you can't? I tried to replicate what you did in IRB but I must be misunderstanding something because it's saying the classes don't exist. Instead, I went and learned a little more about each data structure to see if that helped bridge the gap.

Maybe it's natural, maybe it's not natural; it's a matter of opinion. Different languages take different opinions. If a language makes guarantees about it, you can write your code to take advantage of them.

In that article Paul Ndemo wrote, he defined the data structure classes himself. Are these structure classes just something to abstractly conceptualize, but ultimately tangibly define myself? The main takeaway seems to be that data structures are ways of organizing stored data, where its type determines how items are added or removed. I'm just a little alluded by the wrapper classes concept!

Thank you for your feedback :-)

ianterrell commented 1 year ago

Looks like I have some more to learn. I've seen the term JSON floating around, and know it refers to data and JavaScript, but not much else.

Always more to learn! For myself, too.

JSON originates in JavaScript but is not really about the language anymore. "JavaScript Object Notation". It's simply one a way of representing data.

You might communicate data about people in XML like:

<people>
  <person>
    <name>Alice</name>
    <age>30</name>
  </person>
  <person>
    <name>Bob</name>
    <age>40</name>
  </person>
</people>

You might do the same in JSON as:

{
  "people": [
    { "name": "Alice", "age": 30 },
    { "name": "Bob", "age": 40 },
  ]
}

In the modern web, most remote APIs communicate via JSON. Most of your mobile apps are talking to their servers in JSON, GitHub's API uses JSON, etc, etc.

It strikes a decent balance between human readable, concise, compressible, and most importantly widely supported by tooling. It's the lingua franca of data transfer on the web these days.

But there are plenty of other competing things that have their own niches for data representation: XML is prevalent in big companies and legacy software and has some super powerful capabilities, YAML is used a lot for non-trasferred representations of data in the Ruby world (e.g. Rails config), binary formats like protobuf for high speed data transfer, TOML for configuration, on and on.

I'm noticing a similarity with how items are indexed in an array with how objects are held in memory.

This is insightful and goes about as deep as possible.

It's really rewarding to realize those connections, thank you for pointing out that they're happening.

Here's another fun question for you that relates to the above and all the rest of the data structure discussion as well.

Why is removing from the front of an array called shift and removing from the back called pop? Which one is faster and why? What about adding to front vs back?

Simplified, given x = [5,4,3,2,1], you know that x[0] is at the memory location pointed to by x, and x[2] is at x + 2 * sizeof(integer). ... That behavior is not true for all data structures (everything has tradeoffs).

Would this be the case with structs? I was just learning about those in chapter 4.

These are related things, but not exactly what I meant.

I don't actually know how Ruby's Struct type is implemented under the hood; Ruby is very high level. But assuming you mean something like:

X = Struct.new :foo, :bar
y = X.new 1, 2
# => #<struct X foo=1, bar=2>

Generally in programming languages if you define a type (a struct, a class, etc), then the compiler or interpreter knows where the type's data members live relative to the class. So (hypothetically), Ruby easily and quickly knows that y.foo is located in memory at say "The address of y" and that y.bar is located in memory at "The address of y + sizeof(foo)".

The problem here is that Ruby is just super dynamic and super high level and I don't actually know what it's doing here. In a "simpler" language such as C, it would be the case that based on the struct definition the compiler knows exactly how to find each data member in memory.

But generally what I meant is about finding the same sort of thing in a collection of things. In an array you know exactly where in memory a member is based on the array index. In a hash you (mostly) know where a value is based on the hash of its key (mostly because what if two keys have the same hashcode?). Whereas in something like a tree structure there is no easy index or key to identify where an item is.

I poked around to try and learn what a wrapper class is and how to implement one, but a couple sources said you can't? I tried to replicate what you did in IRB but I must be misunderstanding something because it's saying the classes don't exist. Instead, I went and learned a little more about each data structure to see if that helped bridge the gap.

It was a hastily written set of instructions.

By "wrapper class" I meant "A class that uses an array internally but exposes its own methods to manage it."

If you do something like stack = Array.new then you can stack.push and stack.pop to use it like a stack! But... you can also stack.shift and stack.unshift to get around that limitation. Whereas If you write your own Stack class you can write push and pop methods but omit other methods that would use your data structure "incorrectly."

Ruby is — as always — a funny one to learn first because it's so powerful and so high level that exercises like this might not really make sense. It's more important to know how to do these things in other languages.

That said, in other ecosystems data structures will use the words push and pop to not just mean "to and from the back." e.g. a FIFO queue will often also use pop to mean "next item", even if it comes from "the front".

So in the examples I wrote small tiny classes Word,Stack, andMySet`. The idea is if you wanted, you could try to write your own classes that would work that way.

I do see in an example Paul Ndemo wrote on medium:

I just skimmed that and it looks decent! What he did there is exactly the exercise I was suggesting.

Stacks are linked lists that allow us to store and retrieve data in LIFO order.

They can be implemented with linked lists, but don't have to be! Eventually you may want to learn a little C and write some linked lists and all of these data structures by hand.

Sets are a collection of unordered data that's similar, unique, and can't be duplicated.

I don't know your math background, but think mathematical set. An item is either a member of a set or not a member of a set, but cannot be "twice a member" of a set. The number 1 is in the set of all integers, but the set of all integers does not contain the number 1 twice.

You can build this out in code and it can be useful because you can do mathematical set operations: union, intersection, difference, etc.

Would this be good for usernames, or maybe passwords? Or, would hashes be better for login credentials (username key / password value)? And, if hash, how do you encrypt something like that?

Oooh, fun questions.

Usernames: logically, a set would be a good data structure to store them. Pragmatically, we would put them in a database table and use a database constraint to ensure that every item in that column is unique.

Passwords: logically, two people can have the same password so a set would not work here. Pragmatically, we store password hashes and do not store the raw passwords. How do you encrypt the hashes? You don't; or, the hash is its own encrypted value.

Hashes have the property of being one way only. Let's take a really shitty hash function: count the number of characters. I can hash "password" into the value 8, but I cannot recover the value "password" from the number 8 (it could also be "abcdefgh" or anything else).

So for logging in we generally do this: get the user's username and password, look up the person via the username (unique), and then hash the password and compare it to the stored hash. (Our hash function above is unsuitable for this use case for a variety of reasons, but among them it would authenticate the user with any 8 character password.)

There are a lot of nuances here and vectors of attack to defend against.

I couldn't find Set in ruby-doc.org, but to be fair, the word "set" comes up a lot. Am I maybe using the wrong source?

It's part of the standard library but as a separate module (which is apparently now auto-included as of 3.2).

https://ruby-doc.org/3.2.1/stdlibs/set/Set.html

To see it from the main page https://ruby-doc.org/3.2.1/ you have to scroll down really far and you can see it on the lefthand menu. The on page search looks to not search through the whole of the standard library unfortunately, and your browser search is useless 'cause as you say the word comes up a lot.

I found it via Google in this case. site:ruby-doc.org class set

In that article Paul Ndemo wrote, he defined the data structure classes himself. Are these structure classes just something to abstractly conceptualize, but ultimately tangibly define myself?

In your average Ruby development, it probably doesn't matter much. :) Arrays and hashes will get you there, as is evidenced by the fact that the article's types are all just leveraging array functionality.

In some use cases and in some other domains and in some other programming languages, it's important to know what these terms mean and what these data structures are called and when and why they might be used and their performance characteristics. For most senior roles it would be considered required knowledge.

From a skills perspective and to reinforce learning, they're also useful to write. That said, in most real world use cases there are highly optimized library implementations of everything you need, so it comes down to knowing how to pick what to use.

But I also find them fun. :)

The main takeaway seems to be that data structures are ways of organizing stored data, where its type determines how items are added or removed.

You have two use cases there — adding and removing — but there are more. The biggest one is: how do you find an item?

If I have a data structure full of User instances (each with a username, a password hash digest, a legal name, an email address, a phone number, an address, an age, etc, etc...) how do I find the one I want given some subset of data? e.g. this powers the use case of "Here is my username and password, please log me in." Now I have to find the user via the username so that I can find the password digest to see if it matches the password they gave me.

If the users were stored in an Array, you'd have to look through the items one by one to find the user. If the users were stored in a hash keyed by username, you just look them up via the username.

Then there are the performance questions in both time and space. Looking up the user in the array is slow in time, but stores the least amount of data possible. Looking up the user in the hash is fast in time, but now you have to store additional data for the keys.

And the same is true for all of the operations beyond to find: how fast or slow to insert, to remove, to sort?

So in the general case about data structures the questions come down to: what kind of data am I storing? How much of it? How am I going to be using it on average? What kind of performance can I tolerate?

But in the Ruby on Rails world the answer to the final question is usually, "eh, it's fine" so you don't think about it that much.

There are just other worlds. In AAA game engine development for instance, this is critical stuff.

I'm just a little alluded by the wrapper classes concept!

I just meant like the article did, but on your own. :) An exercise to get used to doing so, and thinking about it, and learning Array and Hash methods.

erikamaker commented 1 year ago

>In the modern web, most remote APIs communicate via JSON. Most of your mobile apps are talking to their servers in JSON, GitHub's API uses JSON, etc, etc. It strikes a decent balance between human readable, concise, compressible, and most importantly widely supported by tooling. It's the lingua franca of data transfer on the web these days.

It looks kind of like embedded hashes. Is that a coincidence?

Why is removing from the front of an array called shift and removing from the back called pop? Which one is faster and why? What about adding to front vs back?

Is it because moving something from the front of an array shifts each item's place in line, whereas popping only removes the end? Would the latter be faster for that reason, because there would be less items to re-index?

The problem here is that Ruby is just super dynamic and super high level and I don't actually know what it's doing here. In a "simpler" language such as C, it would be the case that based on the struct definition the compiler knows exactly how to find each data member in memory.

That makes sense. The higher the level of language, the easier it is to read (as a human), but problems can occur if you're not careful. Like with human language, we have miscommunications all the time, but it's that wiggle room that lets us communicate a little more flexibly if responsibly.

I don't know your math background, but think mathematical set. An item is either a member of a set or not a member of a set, but cannot be "twice a member" of a set. The number 1 is in the set of all integers, but the set of all integers does not contain the number 1 twice.

I only ever got as far as statistics in college. I want to learn discrete mathematics and someday calculus. Also, it makes sense, I see what you're saying.

Then there are the performance questions in both time and space. Looking up the user in the array is slow in time, but stores the least amount of data possible. Looking up the user in the hash is fast in time, but now you have to store additional data for the keys.

Which could be suitable, if the alternative was iterating through the entirety of a gigantic collection rather than looking up that single key (right?). I imagine there's a kind of art to determining what's going to work best, and that there might be "tipping points" to point a developer in one direction over another.

Hashes have the property of being one way only. Let's take a really shitty hash function: count the number of characters. I can hash "password" into the value 8, but I cannot recover the value "password" from the number 8 (it could also be "abcdefgh" or anything else).

To me it sounds like, if someone knew the key, they could get the password. Or, are you saying it's highly unlikely someone would have that key? I think I don't understand how plain text --> hash function --> encryption

erikamaker@erikas-computer:~$ irb
irb(main):001:0> h = {:password => "horsesocks1"}
=> {:password=>"horsesocks1"}
irb(main):002:0> h[:password]
=> "horsesocks1"
irb(main):003:0> 

Taken from a couple paragraphs of the article you shared (below) feels related.

It's easy and practical to compute the hash, but "difficult or impossible to re-generate the original input if only the hash value is known." ... a cryptographic hash function... must be pre-image resistant. So if an attacker knows a hash, it is computationally infeasible to find any input that hashes to that given output.

Rails has built in support doing it the right way: https://api.rubyonrails.org/classes/ActiveModel/SecurePassword/ClassMethods.html#method-i-has_secure_password

I see. So we don't recreate the wheel every time we implement a system for passwords. That makes a ton of sense, and I don't know why I assumed it was something that needed to be written from the ground up every time. They must be good at what they do, if they're so widely available to learn about and utilize.

You have two use cases there — adding and removing — but there are more. The biggest one is: how do you find an item?

I've used SQL a little bit in school to search for and manage data. I also know that, for things like a Ruby array, you can use methods like find to literally find the item you're searching for, and call a block on it (I think I'm saying that right, but let me know if I'm not). So, in some circumstances, you can alter data, in addition to removing or adding it.

In your average Ruby development, it probably doesn't matter much. :) Arrays and hashes will get you there, as is evidenced by the fact that the article's types are all just leveraging array functionality.

I was looking at Dragon Ruby and noticed they not only take advantage of arrays, but celebrate them.

I just meant like the article did, but on your own. :) An exercise to get used to doing so, and thinking about it, and learning Array and Hash methods.

I see, I totally overthought that. Well, the article kind of did more or less what I would have done anyway. But, I can definitely see what's happening in them. For the Queue model, I can see that we're leveraging the array type (to steal your phrasing) to make it behave in a queue-like fashion. Let me know if I'm off about this:

class Queue                       #  Define the data collection type
  attr_reader :items             #  Make a readable instance variable @items

  def initialize
    @items = []                     # At instance's construction, assign an empty array to @items
  end

  def enqueue(item)            # Accepts the item user intends to add to the front of the array
    @items.unshift(item)      # Places that item it to the front of the @items array
  end

  def dequeue                    
    @items.pop                    # Remove the last item from the array.
  end

  def size
    @items.length                # Return the number of items in the array.
  end
end

Thanks for your feedback as always :-)

ianterrell commented 1 year ago

JSON

It looks kind of like embedded hashes. Is that a coincidence?

Not a coincidence, really! JavaScript is a bit different than other languages, but JavaScript "objects" are also its implementation of a hash map. So the "object notation" is basically serializing hashes.

Why is removing from the front of an array called shift and removing from the back called pop? Which one is faster and why? What about adding to front vs back?

Is it because moving something from the front of an array shifts each item's place in line, whereas popping only removes the end? Would the latter be faster for that reason, because there would be less items to re-index?

Yes — it's simple but there are some important lessons here that might only become obvious if you pursue your studies through the topic of algorithmic complexity.

If your array is implemented where the indices correspond to memory addresses, as in a C array or a simple implementation, where your items look like this in memory:

Memory Location Index Value
0 0 a
4 1 b
8 2 c
12 3 d

In that example we can say that each item in the array takes 4 bytes, and the memory address is counted in bytes. So we know the memory location is start + 4 * index.

If you take out a then you have to "shift" the rest: copy b down, copy c down, copy d down. If you take out d, you don't have to copy any data around. Ditto for adding: adding to the front has to copy a, b, c, d up one first, to make room. Adding to the back doesn't have to copy anything, it just writes the new value at the end.

So in this case adding or removing to the front has to effectively copy the entire array! That takes an amount of time that scales with the size of the array. Copying 3 items is probably fast; copying 3 trillion items is probably slow.

Adding or removing from the back doesn't have to do any additional work. The workload is constant, independent of the size of the array.

Which could be suitable, if the alternative was iterating through the entirety of a gigantic collection rather than looking up that single key (right?). I imagine there's a kind of art to determining what's going to work best, and that there might be "tipping points" to point a developer in one direction over another.

It's definitely suitable to trade time and space! Different applications, and different parts of the same application, have different needs.

To me it sounds like, if someone knew the key, they could get the password. Or, are you saying it's highly unlikely someone would have that key? I think I don't understand how plain text --> hash function --> encryption

h = {:password => "horsesocks1"}

I think there might be a disconnect on the term "hash", which does a lot of lifting:

Ruby Hash.new or {} is called a hash because it's short for "hash map." That data structure is called a hash map because it uses a "hash function" to map from keys to values. A "hash function" takes an input and (verb) hashes it into a "hash value."

A hash function can be anything that maps things into values, really. We usually mean it to mean one of... let's split a hair and say three things:

  1. An integer hash value that's useful for mapping and checking (in)equality (this is used under the covers in hash maps)
  2. A non-secure hash that uses more bits and is more useful for comparing contents
  3. A cryptographically secure hash that is considered resilient against attacks to find out the inputs

The first one is used everywhere, and is on every object in Ruby:

Object.new.hash #=> -4372797538721863374

1.hash     #=> 2297626919438411437
(2-1).hash #=> 2297626919438411437

"hello".hash #=> 935407967835860898
"hello".hash #=> 935407967835860898
("he" + "llo").hash #=> 935407967835860898

We define rules for that kind of hash: the same logically equal values must have the same hashcode. What can we tell from them? Well, if two objects have different hashes, then they cannot be equal. (If they have the same hashcode we don't know if they're equal or not, just that they could be.)

This property is useful for narrowing down lists of things very quickly: want to see if two ebook files are the same? It's way faster to compare a hash of their contents than it is to compare all of the words in them.

Are you familiar with the modulo operation? That's probably the next step toward understanding how a hash map uses that to store and retrieve data. It can be an exercise for later; it's fun.

A checksum digest hash function like md5 is used for seeing if data is what we expect, or has been corrupted, or comparing contents equality again. That's actually what would be used for the ebook example; not the integer hash. It's the same idea just with many more bits of information so that a "hash collision" is less likely.

require 'digest'
"hello".hash #=> 935407967835860898D
Digest::MD5.hexdigest "hello" #=> "5d41402abc4b2a76b9719d911017c592"

And then finally there are the cryptographically secure ones, that are just like the above but considered secure against people trying to figure out what the input was (e.g. a password).

But a hash function can be user defined, too. You can write your own, and will eventually for some use cases. The example I gave before was "number of characters".

irb(main):001:1* def bad_hash(x)
irb(main):002:1*   x.length
irb(main):003:0> end
=> :bad_hash
irb(main):004:0> bad_hash "password"
=> 8
irb(main):005:0> bad_hash "foobar"
=> 6
irb(main):006:0> bad_hash "abcdefgh"
=> 8

You can see that "password" hashes to 8. You can also see that "foobar" hashes to 6 — so we know that the two inputs were not equal! But if you have an 8 as a value you don't know that "password" was the input! It could have been "abcdefgh".

erikamaker commented 1 year ago

A hash function can be anything that maps things into values, really.

I see! So, hashing involves mapping a key to a value, and a Ruby hash is a data collection type that hashes. Right?

We define rules for that kind of hash: the same logically equal values must have the same hashcode. What can we tell from them? Well, if two objects have different hashes, then they cannot be equal. (If they have the same hashcode we don't know if they're equal or not, just that they could be.)

Meaning that two separate hashes, containing the same key / value pairs, are held in different spots in memory. So even though one hash can be equal to another in value, it is distinctly a different hashing of its key to that value?

Have I totally butchered that concept? 😅

Are you familiar with the modulo operation? That's probably the next step toward understanding how a hash map uses that to store and retrieve data. It can be an exercise for later; it's fun.

A bit, yes! I recently used it for a fruit harvest schedule in Bonecrawl. Looking forward to working with it more.

But a hash function can be user defined, too. You can write your own, and will eventually for some use cases. The example I gave before was "number of characters".

This helped the concept click a little further, I think. It's like data's wearing a costume, and the zipper's on the inside. 😄

ianterrell commented 1 year ago

I see! So, hashing involves mapping a key to a value, and a Ruby hash is a data collection type that hashes. Right?

I hope I haven't hopeless confused you. I would remove the idea of keys and values here exactly, and simply replace them with input and output. Hash functions take an input from a large domain (e.g. all possible values of everything in the universe) and outputs a result in a smaller domain (e.g. exactly 128 bits).

Meaning that two separate hashes, containing the same key / value pairs, are held in different spots in memory. So even though one hash can be equal to another in value, it is distinctly a different hashing of its key to that value?

Have I totally butchered that concept? 😅

I may have hopelessly confused you. The term is overloaded in Ruby. Hash the {} datatype is not what I meant in the quoted section. I meant the hash function .hash that is available on Object (and therefore everything else): https://ruby-doc.org/3.2.1/Object.html#method-i-hash

The equality notes are repeated in that documentation:

This function must have the property that a.eql?(b) implies a.hash == b.hash.

erikamaker commented 1 year ago

I took the last few days to study HTML & CSS to work on my landing page (Odin Project). Alongside this, I pulled open Ruby documentation, YouTube tutorials, and ran examples of hashes to make sure I understood the foundations for hashing algorithms, and the ruby data collection type{}.

  1. Hashing is a process of transforming a key into a different value, for fast and secure one-way information retrieval.

    a. A very simple example might be {:Michigan => "MI"} ( key => value) b. There are more secure implementations of assigning a fixed output for encryption purposes c. For instance, password hashing often employs a hex output (hexes are readable and easily mapped to individual bits). d. Hashes can "collide" when two pieces of data in a hash table share the same value.

A hash collision occurs when two pieces of data in a hash table share the same value. This happens when data collection becomes so large that it's more statistically likely to happen naturally, but sometimes it happens intentionally / maliciously to take advantage of bad algorithms. I also learned a little history of MD5 and the race between processing power and encryption. A hash algorithm should be fast (e.g. for processing large files), but not too fast (to avoid intentional hash collision that can result in stolen data or malicious file transmission).

  1. A hash object in Ruby is a dictionary-esque data collection type that pairs a key with a value. a. This data collection type in Ruby is a simple representation of the hashing concept b. However, a hashing algorithm can be written without ever using this type (see below, guest starring modulo operator). c. ( Note that this example is very insecure and shouldn't be in production code. )
def hash(string, max)
  string.sum % max
end

irb(main):035:0> my_hash("hello there",85555)
=> 1100

There is a LOT more to cover in respect to hashes and how to write proper hashing algorithms for various scenarios. Let me know if this demonstrates better understanding or if I should dive deeper for the purposes of getting through the textbook! If I'm ready to move on, I'm going to, but I didn't want to prematurely pass this concept.

ianterrell commented 1 year ago

If I'm ready to move on, I'm going to, but I didn't want to prematurely pass this concept.

You've got more of the gist of it than you really need to move on, I promise you. Perhaps too much! There are drawbacks of chatting with me, among them I will chat about everything and point out lots, even things that aren't strictly necessary to know at this stage of your learning. (Also, I might occasionally get things wrong!)

You really do have it, so just a few super minor clarifications — don't let them slow you down!

hex output (hexes are readable and easily mapped to individual bits)

You probably know all of this, but it's worth reviewing just in case you don't.

In this case, hex stands for hexadecimal. Like binary is base-2 and decimal is base-10, hexadecimal is base-16. Like binary uses 0 and 1, decimal 0-9, hexadecimal uses 16 characters 0-9 and a-f. Base-16 happens to be a power of two, so it's convenient to go back and forth to binary. It takes 4 bits to store 16 values in binary but one hexadecimal character (2^4=16). e.g. 0000 = 0, 0001 = 1, 1111 = f.

Again, you probably have done it all — if any of that doesn't make immediate sense, it will eventually once you play with it sometime. Eventually just make sure you understand why in binary 111 + 1 = 1000 and in hex a + 1 = b and f + 1 = 10 and why 110 in binary = a in hex. :)

A hash algorithm should be fast (e.g. for processing large files), but not too fast (to avoid intentional hash collision that can result in stolen data or malicious file transmission).

What is "best" in programming (and life?) is always dependent on the use case. How will it be used?

For password hashing, we want it to be slow. Sometimes up to 1s or more on modern devices. A user won't mind a 1s delay logging in (to hash the password sent), but 1s is sufficiently slow that brute force attacks quickly become infeasible.

On the other hand, maybe we don't actually care about collisions! Modulo is a perfectly fine hash function for a variety of use cases. If the Hash datatype doesn't

A hash collision occurs when two pieces of data in a hash table share the same value.

Although there is a really neat parallel between tables and functions, you should think of this in terms of functions. A collision is when two inputs to a hash function have the same outputs.

This data collection type in Ruby is a simple representation of the hashing concept

The data type Hash is not called Hash because it maps inputs to outputs. It's called Hash because it uses the hash value of the keys to map to values. Under the covers it's calling something like key.hash to get an integer value, and then using that integer value to map to a bucket to store a reference to the value.


The only other thing you didn't explicitly touch on (though implicitly you did) is that you need to know that two objects that are logically equal have the same hash, but two objects with the same hash may not be equal (that's a collision). A good thought exercise is making sure you understand why.


Again, don't let this slow you down! Barrel on!

erikamaker commented 1 year ago

Awesome! Thank you for that feedback. I'm going to play around with hex values a bit this week. I'm feeling confident enough to move forward at any rate. Closing this issue :-)