amyjko / foundations-of-information

A book to support the INFO 200 Intellectual Foundations of Information course.
Other
138 stars 21 forks source link

Incorrect definition of Shannon Information/ Kolmogorov complexity in Ch3 #24

Closed michaelmunie closed 2 years ago

michaelmunie commented 3 years ago

The following quoted section, the way it's currently written, is confusing entropy with Kolmogorov complexity. https://en.wikipedia.org/wiki/Kolmogorov_complexity

"For example, the sequence of letters aaaaa has low entropy, conceptually, as it is the same letter appearing multiple times; it could be abbreviated to just 5 a’s. In contrast, the sequence of letters bkawe has high entropy, with no apparent pattern, and no apparent way to abbreviate it without losing its content. Shannon’s view of information was thus as an amount of information, measured by the compressibility of some data.

Another way to think about Shannon’s entropic idea of information is through probability: if we were to observe each character in the sequences above, and make a prediction about the likelihood of the next character, the first sequence would result in increasingly high confidence of seeing another a. In contrast, in the second sequence, the probability of seeing any particular letter is quite low. The implication of these ideas is that the more rare “events” or “observations” in some phenomenon, the more information that is required to represent it."

Specifically you say "could be abbreviated to just 5 a's" that's most definitely using Komogorov complexity which perhaps would merit inclusion in the book too. The reason why this example must use Komogorov complexity is that you don't make any reference to a random variable that describes the generation of those sequences. Without the prior distribution you must then appeal to Kolmogorov complexity. With a prior distribution, the entropy is then well defined and thus provides a bound on compressability.

Also, I don't believe the 2nd paragraph I quoted above continues in a correct way either. To follow the reasoning as it's currently written, you need at least some modeled hyperparameter governing the generation of the next letter in the sequence with a distribution over the values of the hyperparameter itself. If you don't want to complicate the example one might think we could adjust the 2nd paragraph to just say that if a letter sequence was generated by drawing a sequence of letters uniformly at random, the sequence 'aaaaa' would have low entropy and 'bkawe' would have a high entropy since it seems random -- but not true! They have the same entropy!

Perhaps better would be to use an example where you toss a coin x times, and count how many heads and tails. (like this example: https://courses.lumenlearning.com/physics/chapter/15-7-statistical-interpretation-of-entropy-and-the-second-law-of-thermodynamics-the-underlying-explanation/) Then a set of tosses all heads would be very low entropy and set of tosses with about equal heads and tails would be high entropy. The key distinction is if you view it as a sequence or a set. I think H/T are easier to think of as a set.

"Elements of Information Theory" by Cover and Thomas is a book I like on Shannon information

Anyhow, thanks for putting this book up. I've been quite enjoying it so far, just though I could help on this little detail.

amyjko commented 3 years ago

Thank you for the comment Michael! There's definitely a lot more room for nuance in this section. I'll have to think about what level of nuance is appropriate here, and how to convey it. The book is definitely not trying to teach deep theoretical knowledge of information theory; rather, it's offering concepts that are sufficiently sound for everyday literacy, assuming minimal prior knowledge in mathematics and probability. If you have ideas on how to explain the above distinction without digressing into Komogorov complexity, hyperparameters, and prior distributions, I'd love to hear them.

michaelmunie commented 3 years ago

If you wanted to stick to something close to the current form:

"In 1948, well before Buckland and Bateson were theorizing about information conceptually, Claude Shannon was trying to understand information from an engineering perspective. Working in signal processing at Bell Labs, he was trying to find ways of transmitting telephone calls more reliably. In his seminal work, he linked information to the concept of entropy from thermodynamics. Entropy is a measurable physical property of a state of disorder, randomness, or uncertainty. For example, if you tossed a coins four times and every toss came up heads this situation would have low entropy, conceptually, as there's not any disorder in the results. In contrast, if the tosses came up half heads and half tails this would have high entropy, with no apparent pattern and maximum disorder. Shannon’s view of information was thus as an amount of information, measured by the disorder of the results.

Another way to think about Shannon’s entropic idea of information is through probability: in the first example there's only one way toss four heads so the probability is low. In contrast, in the second sequence, you have a high probability of getting half heads and half tails and could get those flips in many different ways: two heads then two tails, or one head followed by two tails followed by one head, and so on. The implication of these ideas is that the more rare “events” or “observations” in some phenomenon, the more information there is."

michaelmunie commented 3 years ago

You might think about adding a paragraph or replacing the 2nd paragraph with an example that illustrates channel capacity. I was thinking something like: a stoplight has a low channel capacity but is perfect for the situation because when driving you need to know if you should push the gas or the breaks. And compare that to a street sign which can have any road name on it, is slower to read, but conveys more information. But I couldn't figure out an elegant way to write it.

orcmid commented 3 years ago

I also think compressibility, which Shannon also addressed, is valuable in this context. I don't quite know how it works with something such as pseudo-random generators the sequence of which is predictable if you can know the state.

michaelmunie commented 3 years ago

On pseudorandom number generation the short answer is that the generator can only encode as much entropy as is passed to it by the seed itself. David Zuckerman at UT Austin I think is the expert on this topic if you want to know more.

On Mon, Apr 26, 2021 at 9:10 PM Dennis E. Hamilton @.***> wrote:

I also think compressibility, which Shannon also addressed, is valuable in this context. I don't quite know how it works with something such as pseudo-random generators the sequence of which is predictable if you can know the sate.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/amyjko/foundations-of-information/issues/24#issuecomment-827258843, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN3G5HW7VG25MLPPE227XNLTKYMKHANCNFSM43T2BSXA .

orcmid commented 3 years ago

The TLDR: I'm uncertain about the statement that "information is entropy." I think Shannon is careful to speak of the entropy of an information source" and does not collapse entropy and information. He says the entropy is a measure of information, not the information per se.

Further Considerations

I also think compressibility, which Shannon also addressed, is valuable in this context. I don't quite know how it works with something such as pseudo-random generators the sequence of which is predictable if you can know the state.

@mathisonian, yes the seed is a way to know the initial state of the PRNG. On reflection, it is more serious if, with the algorithm known, the state can be inferred from knowledge of the stream. I believe that is why "whitening" of the output is done, usually by some hash process, along with periodic additional seeding based on a seriously-random source. And of course, knowing the seed or any subsequent post-seeded state usually means the PRNG has been penetrated and there are far more things to worry about.

Gettng back to information.md, I see that compressibility is mentioned by Amy in conjunction with Shannon. I think it is valuable aspect is to point out that the entropy is a logarithmically growing value where 0 means completely predictable and It can be usefully described as freedom of choice. Also, relative entropy is commonly used, and that goes from 0 to 1 and represents the degree of choice available relative to a completely random selection on a particular channel of communication. The smaller that value (usually designated H) the more compressible are such messages on that channel.

It may also be useful to point out that the use of "information" in this context is distinct from and not to be confused with meaning. There's an informative Section 2.2 and footnotes in Warren Weaver's Introductory Notes to the 1963 book, "The Mathematical Theory of Communication" by Shannon and Weaver and containing Shannon's original paper (also available as a Kindle book). Shannon speaks of discrete information sources as Markoff Processes all over the place.

michaelmunie commented 3 years ago

I think the accepted use of the terminology in Shannon's theory is that: a random variable can be measured in terms of "entropy" an outcome of this random variable can be measured in terms of "self information" (i.e. the self-information of the random variable X is the same as the mutual information between X to X or in other words the reduction in uncertainty of X due to the knowledge of X. Another way of thinking of this is the Entropy is the expected self-information of a 'message')

https://en.wikipedia.org/wiki/Information_content

amyjko commented 3 years ago

Thank you all for the robust discussion! Keep contributing as you see fit. I'll return to this issue in the coming weeks, once I get through my backlog of other polish on chapters 14-19.