ELVIS-Project / vis-framework

Thoroughly modern symbolic musical data analysis suite.
http://elvisproject.ca/
31 stars 6 forks source link

Send multiple input dfs as a list of dfs, not one big concatenated df. #364

Closed alexandermorgan closed 8 years ago

alexandermorgan commented 8 years ago

The concerned indexers are ngram, dissonance, and figuredbass. For the ngram indexer, this can wait until its pending rewrite, referenced in this issue #360 . This will help the implementation of VIS in Rodan and in some cases will avoid the propagation of NaN values when observations don't occur at the same offsets. In addition, in many cases this may improve performance as indexing gets dramatically simplified. If you're accessing information in an input df that has one column of analysis per part (like the horizontal interval indexer results), then the column index is the part number so integer-based indexing could be used more than label-based indexing. In the interest of predictability and uniformity I recommend putting the inputs dfs in the list in alphabetical order based on the indexer they were produced by.

alexandermorgan commented 8 years ago

@mrbannon, I assigned myself because you can't assign two people, but I was just planning on doing ngram and dissonance, so perhaps you could make this change for the figuredbass indexer. NB: this is relatively low priority at the moment.

crantila commented 8 years ago

I don't understand exactly what this implies, but it sounds a lot like "undo the changes made for VIS 2, but retain the complicated MultiIndex for columns."

Could you provide two tables to illustrate current input vs desired input?

alexandermorgan commented 8 years ago

Sure. Imagine a 3-voice piece that begins with faux bourdon passage. It's Horizontal indexer df of results would look something like this (let's call it horiz_df): print horiz_df

Indexer 'int.HorizIndexer' 'int.HorizIndexer' 'int.HorizIndexer'
Parts '0' '1' '2'
0.0 2 2 2
1.0 2 2 2
2.0 2 2 2

Its vertical interval indexer results would look like this (let's call it vert_df) print vert_df

Indexer 'int.IntIndexer' 'int.IntIndexer' 'int.IntIndexer'
Parts '0,1' '0,2' '1,2'
0.0 'A4' 'M6' 'm3'
1.0 'P4' 'm6' 'm3'
2.0 'P4' 'M6' 'M3'

Right now to use the ngram indexer we would have to pass a "big" df of both of these concatenated: comb_df = pandas.concat([horiz_df, vert_df], axis=1) What I'm suggesting is that instead of combining the individual dfs keep them separate but in a list: list_of_dfs = [horiz_df, vert_df] Either way, the multiIndexes are unchanged so I don't think this is undoing any of the VIS 2.0 progress.

alexandermorgan commented 8 years ago

Christopher, in retrospect I think I understand your point a little better. You're concerned that a "list of dataframes" is not one of the score types permitted in the init method of the Indexer class in indexer.py, right? I think we would just have to add one more type. Is there something I'm missing?

crantila commented 8 years ago

I really didn't understand that at all, but now I think I see the problem you're trying to solve.

My initial plan (years ago) was for indexer output to be cumulative, which is why analyzer names are in the labels. It would look like this:

... and so on.

I had in mind, but never bothered to say or write down anywhere, that the indexers needed to be modified to return DataFrames as above. I recognize that you really hate Nans, but I do think this approach will be less error-prone overall. Allowing lists of DFs as input for an indexer is admittedly a solution that's easier to program into the Indexer classes. But if the Indexer superclass can deal with selecting the appropriate indices from a single inputted DataFrame, that's one more task that indexer subclasses won't get the chance to mess up, and it gets us one step closer to a place where casual programmers can write their own indexers.

If NaNs are a significant motivating factor for this idea, maybe it's time to consider a real solution for that, independently of what's decided here.

alexandermorgan commented 8 years ago

NaN's aren't an issue per se, but I don't think we should generate tons of them for no particular reason. I'm not sure why you would want indexer output to be cumulative. You are aware that this would copy out and increasing amount of data at every step, right? Also, I thought we were going for more modular and segmented architecture. The main reasons for this issue are compatibility with Rodan, simplicity, and to a lesser extent performance.

crantila commented 8 years ago

Oh, I see now. This is really a Rodan-oriented change. It would reduce analyzer complexity at the cost of greater complexity for whatever is using analyzers---but Rodan has already solved that problem. Is Rodan even using InexedPiece?

mrbannon commented 8 years ago

Yes On Sep 16, 2015 4:19 PM, "Christopher Antila" notifications@github.com wrote:

Oh, I see now. This is really a Rodan-oriented change. It would reduce analyzer complexity at the cost of greater complexity for whatever is using analyzers---but Rodan has already solved that problem. Is Rodan even using InexedPiece?

— Reply to this email directly or view it on GitHub https://github.com/ELVIS-Project/vis-framework/issues/364#issuecomment-140874977 .

crantila commented 8 years ago

Cool, good.

I suggest this change belongs primarily in the Indexer superclass, to fight off the situation that was The Disaster of VIS 1, where every indexer had different input and output formats. The Indexer.__init__() method already converts from DataFrame to Series on the fly. All you really need to do is expand the score argument's docs so it accepts a list of DataFrame, and amend __init__() so it can convert list-of-DF to the required_score_type requested by the Indexer subclass.

After that, Rodan can pass a list of DF to any indexers, so allowing list-of-DF and list-of-Series as required_score_type, and revising the indexers to actually use list-of-DF/S, can be done at any later point. Until that happens, admittedly, you're just moving the "mash everything into a single DataFrame" step.

Does this make sense? I honestly can't tell.


If this turns out well, the WorkflowManager could even be rewritten so it's actually easy enough for someone else to modify!

alexandermorgan commented 8 years ago

Yeah, this makes sense and I agree we should just add a list of dfs to the acceptable types in indexer.py.

alexandermorgan commented 8 years ago

This was accomplished with release 2.3.1 (or maybe it was 2.4.1). Now if a list is passed but the score type is still a dataframe, indexer.py checks all the elements of the list to make sure they're all dataframes.