Make an index - Githubissues

RobertHarper commented 11 years ago

i can do this or help to do it, but it has to be close to the last thing, otherwise it's a mess to get right. regardless of who does it, we will have to discuss the level of detail and how to manage cross-indexing. i found both of these hard to decide on for my own recently published book, and i could imagine that we may not agree on this.

cangiuli commented 11 years ago

Yes, we decided that one or two people should take full responsibility for this, and only after the rest of the text is finalized.

andrejbauer commented 11 years ago

I've started working on the index. The first thing I made is a little helper script other/index-helper.py. If you run by

other/index-helper.py | less

it will show you all occurrences of all words, file files and character locations. For example, one entry looks like this:

======== adjoint [0]
   ...ume $f$ to be a half adjoint equivalence and sho... [equivalences.tex @ 18299]
   ... \item $F$ is a left adjoint.\label{item:ct:ar1}... [categories.tex @ 49607]
   ...type ``$F$ is a left adjoint'' is a mere proposi... [categories.tex @ 53581]

It is showing you the word "adjoint", its three occurrences with excerpts, files and character locations within the files. The results are listed in order of increasing frequency of the word in the English language (displayed in brackets next to the word). It knows about the 10000 most frequent words. Any word not known is considered to have frequency 0 and will be displayed at the top (and those will be alphabetized).

You can specify regular expressions on the command line to filter the output. Suppose you want to know where we use "invertible", "bi-invertible", "invert", etc. You run:

other/index-helper.py 'invert'

and you get:

======== bi-invertible [0]

   ...f:A\to B$ is \define{bi-invertible} if it has both a l... [equivalences.tex @ 19054]

======== non-inverted [0]

   ...ds that start with a non-inverted element, alternate ... [homotopy.tex @ 105666]
   ...between inverted and non-inverted elements, and end w... [homotopy.tex @ 105719]

Interesting, why did we define 'bi-invertible' and then never use it again? Anyhow, have fun.

mikeshulman commented 11 years ago

Wow, thanks! What is the next step? Should we divide up the things to index between us somehow?

(The reason for defining bi-invertible and then never using it again is explained, I think, in the introduction to chapter 4.)

andrejbauer commented 11 years ago

I am imagining the process as follows:

With a bit of Emacs magic and using some macros, handle \define{...} occurrences to take care of index entries that are definitions.
Create the output with my script.
Manually delete from the output all entries and exceprts which are not to be indexed.
Use Emacs magic to move through the entries that are left, and index them.

But we need to sleep on this. For example, the current output format is not the best for performing Emacs magic.

andrejbauer commented 11 years ago

Mind you, even the first step cannot be automatic. Something like "Dedekind real" will get indexed as "Dedekind real" and "real, Dedekind", etc.

guillaumebrunerie commented 11 years ago

The first thing I made is a little helper script other/index-helper.py.

I can’t see it, did you forget to push?

mikeshulman commented 11 years ago

So what would be the most useful thing for the unadorned macro \define to do with regard to indexing? Just index whatever phrase is given to it? Index each word in the phrase? On May 15, 2013 2:04 PM, "Andrej Bauer" notifications@github.com wrote:

Mind you, even the first step cannot be automatic. Something like "Dedekind real" will get indexed as "Dedekind real" and "real, Dedekind", etc.

— Reply to this email directly or view it on GitHubhttps://github.com/HoTT/book/issues/94#issuecomment-17966116 .

andrejbauer commented 11 years ago

I created enough of an index to expose any problems we might face. Now is time to look at it and see what sort of strategy we're going to use. I have gone through a bunch of \define's and produces index entries for them.

In general we wish to add a composite phrase in several places. For example, "surjective function" would go under both "surjective -> function" and "function -> surjective".

mikeshulman commented 11 years ago

Yikes! I thought we were going to redefine \define to include an indexing entry, so that we wouldn't have to write all those indexing commands by hand.

andrejbauer commented 11 years ago

If you look at the typical indexing next to a \define I think you will quickly realize that it just has to be done by hand. The benfit of having some \define's do their indexing automagically is close to zero. For example, very often you have to inculde more than one entry.

RobertHarper commented 11 years ago

yeah, unfortunately indexing has to be done by hand. i tried various forms of automation, and quickly realized that there is no alternative to manual labor. on the bright side i found it useful to do it, because it exposed some structural problems that i wouldn't have noticed, particularly when you start to think hard about index subitems and cross-referencing.

mikeshulman commented 11 years ago

Looking through the index entries you've added so far, it appears to me that a significant fraction of the time, one of the entries to be indexed is exactly the term inside the \define{...}, so that some work would be saved by having one form of \define index its argument automatically. But I will bow to those with more (= some) experience creating indices.

mikeshulman commented 11 years ago

It looks like \indexdef does not produce hyperlinked index entries, whereas ordinary \index does. Can that be remedied?

mikeshulman commented 11 years ago

Hmm, also, shouldn't index entries come at the beginning of defn environments rather than the end? If the definition ends up split across pages, we want the index entry to point to its beginning.

mikeshulman commented 11 years ago

A couple more questions for people with more experience creating indices.

What should the criteria be for deciding whether an entry is an \index or an \indexdef? Some of the existing \indexdefs didn't look to me like they should be that way; my inclination would be to only do that for important mathematical definitions, not for say the induction principle of cartesian products or the introduction of the word "context" in chapter 1, even though the latter are in \defines.
What should the criteria be for deciding how much to duplicate index entries and when to do a cross-reference? For instance, we have both "eta-conversion" and "conversion, eta-" pointing to the same page, rather than one cross-referencing to the other. We also had the same with "induction principle" and "eliminator", and likewise "recursion principle", "recursor", and "non-dependent eliminator", but I figured those were too much (since they each contain many sub-entries) and make the latter ones cross-reference the former. Is it reasonable to cross-reference when there are subentries but not otherwise? Should we always cross-reference and force people who look up "eta-conversion" to flip over to "conversion, eta-" before getting a page reference?

andrejbauer commented 11 years ago

I would say, first of all, that the index is not a mathematical construct and it does not have to be consistent. Of course it shouldn't be a mess either.

When indexing a term such as "weak equivalence" I generally ask myself "Where do people exepect to find this in the index?" And then I place it everywhere that they might look for, i.e., it is silly to "save space" in the index, or to not include entries on the grounds of "consistency". People don't care about that, they just want to find whatever they're looking for, and they won't "study the system" by which we constructed the index.

For \indexdef entries, I put those in whenever there was a \define, but maybe that wasn't the best strategy. I think in general you put in an \indexdef for the main entry, whatever that is (definition, where the discussion can be found). Not every entry has an \indexdef.

mikeshulman commented 11 years ago

I think it would be bad if, say, all types had an index entry under 'induction principle' but only 75% of them had an identical entry under 'eliminator'. That's what I meant by 'consistency'. On May 21, 2013 11:14 PM, "Andrej Bauer" notifications@github.com wrote:

I would say, first of all, that the index is not a mathematical construct and it does not have to be consistent. Of course it shouldn't be a mess either.

When indexing a term such as "weak equivalence" I generally ask myself "Where do people exepect to find this in the index?" And then I place it everywhere that they might look for, i.e., it is silly to "save space" in the index, or to not include entries on the grounds of "consistency". People don't care about that, they just want to find whatever they're looking for, and they won't "study the system" by which we constructed the index.

For \indexdef entries, I put those in whenever there was a \define, but maybe that wasn't the best strategy. I think in general you put in an \indexdef for the main entry, whatever that is (definition, where the discussion can be found). Not every entry has an \indexdef.

— Reply to this email directly or view it on GitHubhttps://github.com/HoTT/book/issues/94#issuecomment-18259319 .

mikeshulman commented 11 years ago

I want to work on the index, because it needs to get done in the next week or so and no one seems to be working on it. So far I have been fixing things that seem wrong, and thereby getting drawn into doing other things out of order, but I should start being more systematic. I guess I will start with your step 1 by looking for all \defines, and then have a look at your helper script.

mikeshulman commented 11 years ago

Why is a phrase like "2-category" alphabetized as ".category" rather than "2-category"?

awodey commented 11 years ago

Could be like:

Category, 2-, bi-, pre-,

etc.

Sent from my iPhone

On May 22, 2013, at 1:18 PM, Mike Shulman notifications@github.com wrote:

Why is a phrase like "2-category" alphabetized as ".category" rather than "2-category"?

— Reply to this email directly or view it on GitHub.

mikeshulman commented 11 years ago

I indexed a bunch of definitions: 69001947

mikeshulman commented 11 years ago

It also looks odd for the alphabetization to go \dagger-category, \modal-connected, \dagger-precategory, \modal-truncated.

andrejbauer commented 11 years ago

I think the idea is that symbols all get bunched together at the beginning of the index, so they're all aphabetized as if they were a character that comes before all letters. If we put $\dagger$-category under one letter and $\moda$-category, under another, nobody will ever find them.

mikeshulman commented 11 years ago

I agree that all symbols should come before all letters; I think that's fairly standard. I was saying I think it's weird to treat all symbols as equivalent -- I would expect \dagger-category and \dagger-precategory to be next to each other. I think I would also expect numbers to come in between symbols and letters (in numerical order). On May 22, 2013 9:03 PM, "Andrej Bauer" notifications@github.com wrote:

I think the idea is that symbols all get bunched together at the beginning of the index, so they're all aphabetized as if they were a character that comes before all letters. If we put $\dagger$-category under one letter and $\moda$-category, under another, nobody will ever find them.

— Reply to this email directly or view it on GitHubhttps://github.com/HoTT/book/issues/94#issuecomment-18322319 .

mikeshulman commented 11 years ago

I enhanced the index-helper script a bit: e92a432. In particular, it was missing a lot of words because the 20-character-on-each-side context was part of the regexp match, and finditer doesn't allow overlapping matches.

Also, I'm curious: exactly which word frequency list on wiktionary did this come from? I'm puzzled by the "words" like mr|mr, u, blockquote, dr, and sn.

mikeshulman commented 11 years ago

I did the 'A's: cdec2aa302.

awodey commented 11 years ago

looks great! can you give us some instructions for how you did it ? if we each take few letters, we could be done soon. maybe we should "check out" the letters while we're working on them, so we don't overlap too much?

On May 24, 2013, at 2:25 AM, Mike Shulman notifications@github.com wrote:

I did the 'A's: cdec2aa.

— Reply to this email directly or view it on GitHub.

mikeshulman commented 11 years ago

I pushed a little Emacs script that I've been using, with instructions: e0c0604. I'm not sure whether or not it's actually helpful over just bouncing through the files by hand with M-x tags-search, but you can try it out if you like.

Checking out letters before starting to work on them seems like a good idea.

mikeshulman commented 11 years ago

I'm going to start on the common words (the un-alphabetized ones that come after "zy" in the output of Andrej's script).

andrejbauer commented 11 years ago

Whoa, your emacs script is really cool. I will try letter B now. How are we going to avoid merge problems?

mikeshulman commented 11 years ago

Cross our fingers and hope, and commit and pull often, especially before and after making more global sorts of changes? On May 24, 2013 9:56 AM, "Andrej Bauer" notifications@github.com wrote:

Whoa, your emacs script is really cool. I will try letter B now. How are we going to avoid merge problems?

— Reply to this email directly or view it on GitHubhttps://github.com/HoTT/book/issues/94#issuecomment-18417053 .

DanGrayson commented 11 years ago

... and edit individual lines of text without rewrapping entire paragraphs ...

On Fri, May 24, 2013 at 3:04 PM, Mike Shulman notifications@github.comwrote:

Cross our fingers and hope, and commit and pull often, especially before and after making more global sorts of changes? On May 24, 2013 9:56 AM, "Andrej Bauer" notifications@github.com wrote:

Whoa, your emacs script is really cool. I will try letter B now. How are we going to avoid merge problems?

— Reply to this email directly or view it on GitHub< https://github.com/HoTT/book/issues/94#issuecomment-18417053> .

— Reply to this email directly or view it on GitHubhttps://github.com/HoTT/book/issues/94#issuecomment-18423103 .

mikeshulman commented 11 years ago

Should Greek letters be indexed as symbols (i.e. at the beginning, before roman letters) or according to the spelling of the letter (i.e. '\lambda-abstraction' in between 'kernel' and 'law')?

mikeshulman commented 11 years ago

And why does \indexdef produce a different index entry than \index when given the same argument? See \lambda-abstraction, for instance.

mikeshulman commented 11 years ago

An arbitrary sampling of a dozen or so math books on my bookshelf suggests that many don't try to index terms that start with Greek letters at all, but of those that do, a majority index them according to the spelling of the letter. I only found two which indexed the Greek letters with the symbols at the beginning -- and I found one which ignored the Greek letter entirely, indexing '\lambda-abstraction' as if it were simply 'abstraction'!

mikeshulman commented 11 years ago

I recently discovered that you can index ranges as well as individual pages. In case anyone else was unaware of this, type \index{thing|(} at the beginning of the range and \inde{thing|)} at the end, and the index entry will span the appropriate pages. If you want to get the effect of an \indexdef also with this construct then you have to type explicitly \index{thing|(defstyle}.

Also, it seems that for now, the best way to avoid duplicate entries for entries containing symbols is to index them with defstyle explicitly, e.g. write \index{lambda-abstraction@$\lambda$-abstraction|defstyle} rather than \indexdef{lambda-abstraction@$\lambda$-abstraction}. There aren't that many such entries, so this shouldn't be so much of a problem.

mikeshulman commented 11 years ago

I am done with the unalphabetized common words. Tomorrow I'll check out some more letters and get to work on them. Tomorrow is also basically the only day I have left to work on this before our self-imposed deadline, so it would be nice to have some help working on the index...

DanGrayson commented 11 years ago

I'm not sure I have any time to spare, but, if I do, how would I check out a letter?

mikeshulman commented 11 years ago

Just write in a comment here that you're going to do it.

mikeshulman commented 11 years ago

Andrej (and everyone else), can you please check out the letters you're currently working on (just post a comment here before you start working on them) so that no one else tries to do them at the same time?

DanGrayson commented 11 years ago

I see commits from Andrej for T U V W X Y Z. One can surmise he's working on S now.

I'll take L.

andrejbauer commented 11 years ago

I am not working on S. I am playing with \hbox \vbox magic to make the lulu.com cover. I am failing (see 57fb7909cdb93990f846b36148c280beac8b8bd7).

mikeshulman commented 11 years ago

Everyone, please before you index something, have a look at how it is already indexed. If there is an \indexsee{A}{B} command somewhere, then in general, other occurrences of A should be indexed as B, not as A. I've just fixed this for cartesian products (1c4296f).

mikeshulman commented 11 years ago

I have a few minutes so I'm going to try indexing G.

andrejbauer commented 11 years ago

Who is going to index the F words :-) ?

mikeshulman commented 11 years ago

Now that you've said that, I think you have to do it.

I'm going to try H.

DanGrayson commented 11 years ago

:-)

That reminds me: what jokes are we going to insert into our index? Does anyone have any suggestions?

On Thu, May 30, 2013 at 5:16 PM, Andrej Bauer notifications@github.comwrote:

Who is going to index the F words :-) ?

— Reply to this email directly or view it on GitHubhttps://github.com/HoTT/book/issues/94#issuecomment-18709045 .

mikeshulman commented 11 years ago

The best one I've been able to think of so far is to have entries for denial, anger, bargaining, depression, and acceptance pointing to appropriate sections.

We could also try putting in unintentional type theory, although I'm not sure what to make it refer to.

mikeshulman commented 11 years ago

H is done, 3d6c80a5

HoTT / book

Make an index #94