Multiindex necessary? - Githubissues

musicnerd commented 7 years ago

I was talking with Alex M. today about some annoyances that arise seemingly out of VIS's use of the multiindexed dataframes. Although Alex explained how the multiindex can be used, it seems that nobody is really making use of it. That is, everything most users do, they could do with just simple (non-multiindexed) pandas dataframes.

Getting rid of the extra "level" would make a lot of the indexing and renaming much simpler, no? Also, by removing the multiindex level, newer users to VIS would have an easier time manipulating objects and using the tools since there are a lot of tutorials, etc. for using pandas dataframes (but some of the manipulations as shown in those tutorials don't work within VIS because of the multiindexed dataframe). Thoughts?

musicnerd commented 7 years ago

Well for now, yes all 5 users. Do you use the multiindex? Simpler is always better exactly! So get rid of the multiindex! I'm not suggesting that things can't be done, it's just confusing for new users (something that will presumably be an issue in the future if we ever want other people to use it) and re-naming is a pain. I'm just trying to figure out why it was set up like this in the first place and why it remains necessary (and/or how hard it would be to get rid of it and keep VIS's functionality)

alexandermorgan commented 7 years ago

Perhaps @crantila could speak to the logic behind the multi-index in its original conception. The utility of labelling data is sort of obvious but there are a couple of new directions that we have been taking with VIS that make the columnar multi-index either increasingly troublesome or unnecessary.

It is more troublesome than before because if you want to filter one dataframe based on the values in another dataframe, you have to first remove the multi-indecies of both dataframes. We have only recently started to do this type of conditional filtering quite a lot. For example, say you have the noterest results of a piece saved to the variable nr, and the durations of those notes and rests saved to the variable dr. Both nr and dr are dataframes, and they are the same shape. But since their multi-indecies don't match (because the first level of nr is 'noterest.NoteRestIndexer' and the first level of dr is 'meter.DurationIndexer') you can't filter the notes for their durations until you remove or index through the first level of both multi-indecies. So, specifically if you were looking for all the instances of 'C4' that last at least a half note (2), you have to run this line of code:

nr['noterest.NoteRestIndexer'][nr['noterest.NoteRestIndexer'] == 'C4'][dr['meter.DurationIndexer'] >= 2]

But, if we didn't have multi-indexed columns, that same line of code would look like this:

nr[nr == 'C4'][dr >= 2]

@mborsodi , you were using conditional filtering quite a bit in your final projects, would you care to weigh in on this issue?

Ok, so that was the ease-of-use consideration. In VIS 3, the multi-index will not be needed as much as it was before because most results will be stored as values in a dictionary of analyses stored as an attribute of an indexed_piece. Previously this caching was only done for the noterest indexer results. Since the key that you use to access a dataframe in this analysis dictionary via indexed_piece.get_data() is the name of the indexer in question, the label is still there, just not in the same place. It's not as if we have a bunch of dataframes just floating around without knowing what they correspond to. In fact, the dictionary keys are more precise because in addition to showing what indexer they were run with, the dataframes can be linked to the piece they were run on since they are associated with a specific indexed piece.

alexandermorgan commented 7 years ago

This is related to #394 .

crantila commented 7 years ago

In fact, the MultiIndex columns are the simplest solution. What nobody remembers is the problems they solve. Multi-indexed columns allow the results of every Indexer used in an analysis to be stored together, which is better for remotely caching results. Multi-indexed columns allow every single Indexer to have the same data structure for input and output: a single DataFrame. Finally, multi-indexed columns avoid ever having to ask "but where did these data come from?" Just look at the DataFrame and you can see every step along the way.

VIS isn't optimized to be easy to learn or use. If something you found on StackOverflow doesn't work with VIS, too bad, it's your responsibility to understand what you're doing and how to do it properly. If you don't like typing the long, namespaced indexer names, too bad, they aren't there for your sake to begin with.

It's interesting for me to watch how VIS evolves over time. Before it was a Framework, VIS was a set of music21 scripts. As the GUI grew, the scripts evolved into an MVC-like structure where the data manipulation was still poorly-structured music21 scripts (VIS 1 through 9). As we dreamed of the web and analyses beyond counterpoint, we needed not specific to one UI, so we adopted pandas (VIS Framework 1.0). As we gained engineering experience and realized that nobody was writing and running VIS scripts by hand, we moved toward a more robust, highly predictable, but less flexible structure (VIS Framework 2.0).

Times have changed. RODAN learned how to handle things other than OMR, and now the robustness and predictability happens there. Unlikely though it was, more people started writing VIS scripts. Caching on remote servers has failed to materialize, but there's a new caching plan. Even though we've learned how to handle NaN values, now it's just a pain in the ass.

Multi-indexed columns were the simple solution. The situation has changed and those problems are solved in different (significantly more complicated) ways now, so the column names end up looking bizarre and superfluous.

Let's shed a single tear: multi-indexed columns should go.

alexandermorgan commented 7 years ago

@crantila , your post is both informative and hilarious. Bravo!

ahankinson commented 7 years ago

The situation has changed and those problems are solved in different (significantly more complicated) ways now...

Could you explain this a bit more?

crantila commented 7 years ago

@ahankinson stop teasing, surely people have other things to do than write extremely verbose github comments, like writing documentation... :stuck_out_tongue:

That's a strange way to say "please help us by working on the API!"

@ahankinson VIS Framework was designed to run on a server as the primary thing in charge. That was the biggest influence on architectural decisions like the multi-indexed columns. Since I left, there's been a trend to decentralize complexity: server-side data flow is handled by RODAN; data visualization by VIS-Ualizer; nobody even remembered WorkflowManager or AggregatedPieces; for some reason there are extensions for both Max/MSP and PureData; and incomprehensibly the melodic search is being implemented in an extension instead of as Indexers.

For the most part this means VIS itself can be simplified, but not always. Some of these changes are obviously better design decisions, some of them seem motivated by other concerns, but each one of them contributes to significantly more complicated overall solutions.

crantila commented 7 years ago

Cleverly edited response. Snarky comments in produce snarky comments out. I'm sorry for criticizing decisions I don't understand. The point remains: the benefits you're seeing come at the cost of higher overall system complexity.

If you're looking for substantial effort, I am available for contract work. Otherwise I would be happy to help with specific tasks.

alexandermorgan commented 7 years ago

@crantila thanks for your comments. I think you make a good point about how the current solution of keeping the indexer and class information in the columnar multi-index is the simplest solution as far as the program architecture is concerned. So the ease of use we're looking for with the removal of this multi-index does come at the cost of somewhat more complexity in the backend. I hadn't thought of it that way but I agree with it. More generally, @crantila your feedback and/or critiques on this and any other topics are always welcome and appreciated. With respect to the dictionary of analyses I mentioned in my comments above, it already appears on alex_devel but I won't push it to develop until I'm done fixing tests, adding new ones, and documenting the changes. Only once those changes are resolutely squared away would it be appropriate (in my opinion) to remove the first level of the columnar multi-index.

ahankinson commented 7 years ago

Thanks, Christopher. I was genuinely interested in your response to my question, and not trying to provoke anything. I think the intention for the melodic search is that it will be rolled into an indexer, but it's just sitting outside VIS while we figure it out.

ELVIS-Project / vis-framework

Multiindex necessary? #410