RFC: Proposal for a New N-Gram Indexer

crantila commented 9 years ago

I have two goals in mind: to use pandas in the best way possible; and to build musical/statistical flexibility/credibility. We can accomplish this by doing two things: fully realizing the "k-part" aspect, and separating n-gram finding from n-gram formatting. First I'll explain the problem a bit more, then my proposed solution.

The idea of "k-part anything-grams" stems back to Jamie, who saw that an n-gram-finding algorithm could be used for much more than intervals alone. Our commonest n-grams so far are 2-part: they "chain together" a Series of "vertical intervals" with a Series of "horizontal intervals." Sometimes we've looked at 5-part n-grams, chaining four Series of vertical intervals with a Series of horizontal intervals. But the "horizontal" and "vertical" terminology is (or at least was) artificially limiting how the team's musicologists were thinking of how to analyze musical patterns, which in turn artificially limits (or at least used to limit) the programmers, which in turn limits the musicologists, and on and on. With pandas-powered VIS, and especially with VIS 2, the NGramIndexer can and should be detached from the directions completely.

To put it another way, the directional aspect of VIS's interval n-grams is a point of human interpretation, that I wrongly made into a technical limitation. The fact that we talk of musical n-grams as fundamentally different from those in any other field is harmful to our credibility, to our ability to reuse existing research, and more importantly to our ability to show of our "multi-dimensional n-grams," which may in fact represent an important innovation in n-gram-based research. The "horizontal" bit fits into this: the 7 1 6 -2 8 3-gram is really grouped by the NGramIndexer like [7 1] [6 -2] [8 END], which is a proper and complete 2-part 3-gram, and it's only in the formatting stage that the END token is just omitted.

More than simply building credibility, thinking of all musical n-grams as proper n-grams opens additional doors, especially combined with an arbitrary number of dimensions. We could make a 3-part 3-gram by including the pitch class of the lower voice: [7 E 1] [6 E -2] [8 D END]. This is already sort of possible, but we would have to tell the NGramIndexer that the input Series with pitch classes is a "vertical" Series. This hurts most when it comes to formatting, since the current NGramIndexer can only distinguish between two dimensions. The 3-part 3-ngram above could be printed out as [7 E] 1 [6 E] -2 [8 D] or 7 E 1 6 E -2 8 D but not 7 (E) 1 6 (E) -2 8 (D) or 7 1 (E) 6 -2 (E) 8 (D) or anything else that might more helpfully indicate that there are three dimensions of meaning here: vertical intervals, horizontal intervals, and pitch classes.

The thing is, we don't really know which musical elements are the most important, and it would serve us well to be thinking with multi-dimensional feature sets, as audio-centric MIR researchers do. This is much easier with a proper indexer for k-part n-grams. Consider this still-rudimentary example, where I have a DataFrame called ngrams that's full of 3-part 3-grams where each moment has a vertical interval, the pitch classes of the upper and lower voices involved, and a horizontal interval. Hypothetically, I can access the 42nd n-gram in the piece, as a whole, with ngrams.iloc[41]. I can access the second "moment" in that n-gram with ngrams.iloc[41].iloc[1]. I can access the pitch class of the higher voice at that "moment" with ngram.iloc[41].iloc[1].loc[('NoteRestIndexer', '0')].

This opens the door to a wide range of new indexers and experimenters because it's possible to recover information from previous indexers that would currently need some clever coordination between multiple indices. I don't think this sort of functionality would be used right away, but this seems like one of those situations where we have to add functionality before anyone knows how to use it. Moreover, because this feels (to me) like a better strategy for gathering n-grams anyway, there's no reason we shouldn't be outputting this complicated DataFrame straight from the NGramIndexer just in case someone else can think of a good way to use it.

You might be thinking it's bad to separate the NGramIndexer and its formatter, because it means more computation time. While I concede that's a disadvantage, I think the benefits are worth it. We can access the "raw" n-gram data for new types of experiments; we can separate two fundamentally different tasks for better abstraction; and we can make a wide variety of formatters much more easily.

One such formatter, which Jon has requested numerous times, could be called VocabularyIndexer. It would map every n-gram onto a set of characters, so that 7 1 6 -2 8 might become a and 3 2 3 -2 3 might become b, for example. We could figure out a way to feed it an existing vocabulary from a previous run too. Maybe it could remove n-grams that don't meet certain criteria (e.g., removing interval 3-grams that don't have a dissonance in the second "moment") or that aren't in the existing vocabulary.

Another possible formatter would produce proper figured bass signatures. I feel like one of the biggest limitations in Prof Rusch's "chord n-gram" research was that the NGramIndexer effectively produces a voice-leading n-gram, rather than the figured-bass n-gram she wanted to use. Now don't get me wrong: voice leading is a very important phenomenon, and Prof Rusch used VIS to great effect for this purpose, but what she really wanted all along was figured bass signatures.

Both of these formatters are technically possible already, but they would have to parse the current NGramIndexer output into its components before reformatting it. These formatters will be much easier to write if the NGramIndexer instead produced "raw n-grams."

crantila commented 9 years ago

@alexandermorgan

alexandermorgan commented 9 years ago

You've got a lot of great ideas. I'd like to add a couple of additional points: 1) The rewrite should avoid successive dataframe indexing like the plague. By this I mean doing things like: df[0][2][3] or df.iloc[2].iloc[4].loc[('offsetIndexer', '2')] etc. Instead we should pass tuples and to cut through multiindecies in one move and use .at[] and .iat[] instead of .loc[] and .iloc[] whenever possible. 2) When building and formatting ngrams, we should add the first unit's vertical items and the first unit's horizontal items linking it to the second item at each step in the loop. Then at the end, if necessary, add the singleton vertical event. So with the famous 7 1 6 -2 8 3-gram, it would get built like this: 7 1 7 1 6 -2 7 1 6 -2 8 This would facilitate the creation of "continuous" n-grams and would allow us to stop using horizontal indexer results that were calculated with horiz_attach_later = True. 3) I like the idea of allowing for an arbitrary number of dimensions, though if I'm not mistaken even if we have 3 or more dimensions, each one of those dimensions will necessarily be either vertical or horizontal events. To take up your 3-dimensional example from above that combined vertical intervals, horizontal connecting intervals, and pitch classes, the added dimension of pitch classes is another vertical type of dimension. So we should be able to add arbitrary numbers of vertical and horizontal dimensions, but we should recognize that all possible dimensions in music analysis with be either of a vertical or a horizontal nature. A example of another horizontal dimension that would be of interest would be tracking the horizontal motions of a second voice, or the results of the (currently imaginary) text indexer which would say whether the syllable (or word) of the first unit of an ngram is continued to the next unit, or if it changes.

crantila commented 9 years ago

It sounds like you're not quite grasping my primary goal here, so I'll give an example.

My point is that musicological concerns shouldn't even enter the discussion about how to find n-grams (formatting is a different matter). I would prefer to phrase your musicological assertion like this: all possible inputs to the NGramIndexer will either have the last observation automatically set to NaN or they won't. Everything else can be left to the formatting step.

The way I suggest building n-grams in the NGramIndexer is approximately like this:

A | B
=====
7 | 1

A | B
=====
7 | 1
6 | -2

A | B
=====
7 | 1
6 | -2
8 | 5

A | B
=====
7 | 1
6 | -2
8 | NaN

It's only the formatter that needs to know "put A before B and omit the NaN."

It's entirely possible that we'll run into an obstacle that prevents us from doing this. However, my guess is that the "we are chaining together different musical dimensions" perspective is costly algorithmically and for the social reasons described above.

alexandermorgan commented 9 years ago

I don't see the benefit of adding NaN's just to take them away. We could have the inner loop (the one that loops through each n-gram) loop over range(n-1). In your example above that would create leave you with 7 1 6 -2. Then you add the last vertical observation and you're done if you want "standard" n-grams. If you want "continuous" n-grams you loop over range(n-1+1) i.e. range(n), and then you don't add a final vertical observation, but the loop itself remains unchanged which I think is nice. I'm a little unsure of the idea of separating finding ngrams from formatting ngrams, because that seems to unnecessarily elevate the concept of an ngram. I think ngrams are what we format them to be.

alexandermorgan commented 8 years ago

The new ngram indexer should be able to distinguish between asking for multiple analyses to be combined (e.g. 'all pairs' in a four-voice piece) and asking for different numbers of voices (i.e. 2, 3, 4) in the analysis. Both of these concern the "vertical" component but at different stages of the analysis. Right now it is unclear how to make requests for multiple combinations at once so I have to depend on the workflowmanager magic but this isn't as flexible as I need it to be.

crantila commented 8 years ago

As it exists now, you have to use the WorkflowManager to get multiple part combinations of n-grams because that functionality doesn't exist in the NGramIndexer. If you want to customize the part-combination selection, the method to copy and modify is this one.

In the future, should/could the NGramIndexer grow the functionality to produce multiple part-combinations of n-grams at once? Sure. I'm a little concerned about making that indexer (and its settings) even more complicated, but this might help to simplify other parts of the program.

But remember the cardinal rule of indexers: take one piece's data as input, do some analysis, add a single new index of data as output.

(If you're running an indexer on multiple pieces, use AggregatedPieces, which calls the indexer with one piece at a time. If you're combining results from multiple pieces into a single DataFrame, the only analyzer that currently does this is the ColumnAggregator. Yes, it needs to be documented better).

alexandermorgan commented 8 years ago

Release 2.4.1 provides the new_ngram indexer and deprecates the previous one. The old ngram indexer will be removed in version 3. Also @mborsodi is making progress with the windexer which has the main functionality that @crantila was mentioning above.

ELVIS-Project / vis-framework

RFC: Proposal for a New N-Gram Indexer #360