WheatonCS / Lexos

Python/Flask-based website for text analysis workflow. Previous (stable) release is live at:
http://lexos.wheatoncollege.edu
MIT License
118 stars 20 forks source link

Natsorting pandas dataframes #523

Open scottkleinman opened 7 years ago

scottkleinman commented 7 years ago

Most the time, we don't need to do this, but occasionally it's necessary. For instance, if we use the head() function to grab the top N rows in the dataframe with rows of terms sorted by counts or frequencies, the excluded following rows at the cut off point may not be entirely as expected if the user is thinking alphabetically. This will be a very rare occurrence, but it could happen.

It would be nice to figure out a way to implement natsorting of pandas dataframes, but it is not straightforward. The place to start is the note in the natsort documentation.

mleblanc321 commented 7 years ago

i've lost the sequence of data structures as we proceed thru workflow; cheng, is this in any of your docs? should we start again?

czhang03 commented 7 years ago

I don't think this is in my docs. I guess we can do some investigation on that this summer.

scottkleinman commented 7 years ago

Is it here?

czhang03 commented 7 years ago

in fact, I am not even very sure why and where did we use natsort...

scottkleinman commented 7 years ago

Wow! I think I commented on @mleblanc321's query before I had coffee and thought he was asking about the order of scrubbing. Totally off base.

Here's a better answer. In Tokenizer, the DTM numpy array is pulled into a pandas dataframe, which allows for various manipulations (slicing, calculating totals and averages) before we generate the output the data to be displayed in the table. I think we might be passing the DTM through pandas elsewhere now, but I the issue is really only one of what data gets displayed. If you want to take the top N most frequent terms in the dataframe, you can sort it and then use the head() function, but you might in rare instances have items cut off in an "unnatural" order as described in the initial post for this issue.

The use of pandas is easiest understood in the algorithm for Word Cloud. Here are the basic steps:

There is no natsorting. Word Cloud does produce a term counts table, but that is produced from the JSON object, which has the Python natsort() applied to it. I think MultiCloud and BubbleViz work along similar lines to Word Cloud.

Tokenizer is also pretty much the same, except it has lots of complications added to handle rotating the dataframe and to calculate extra statistics. I believe the data from the dataframe is extracted into a separate matrix dict to which natsort() is applied prior to output. It's possible that the slicing could be done on the natsorted dict. I'm not sure how that would affect performance.

mleblanc321 commented 7 years ago

gee, this is more than i thought; grrr, i'm not understanding all this right now, but the handling of the DTM/dataframe is important; very helpful post, thx

scottkleinman commented 7 years ago

When you look at the code for Word Cloud in lexos.py, it's pretty simple. Tokenizer is just has more operations performed on the dataframe. I'll see if I can add some extra commenting. But those are the two functions to look at.