Open scottkleinman opened 7 years ago
i've lost the sequence of data structures as we proceed thru workflow; cheng, is this in any of your docs? should we start again?
I don't think this is in my docs. I guess we can do some investigation on that this summer.
Is it here?
in fact, I am not even very sure why and where did we use natsort...
Wow! I think I commented on @mleblanc321's query before I had coffee and thought he was asking about the order of scrubbing. Totally off base.
Here's a better answer. In Tokenizer, the DTM numpy array is pulled into a pandas dataframe, which allows for various manipulations (slicing, calculating totals and averages) before we generate the output the data to be displayed in the table. I think we might be passing the DTM through pandas elsewhere now, but I the issue is really only one of what data gets displayed. If you want to take the top N most frequent terms in the dataframe, you can sort it and then use the head()
function, but you might in rare instances have items cut off in an "unnatural" order as described in the initial post for this issue.
The use of pandas is easiest understood in the algorithm for Word Cloud. Here are the basic steps:
utility.simpleVectorizer
.d3.js
.There is no natsorting. Word Cloud does produce a term counts table, but that is produced from the JSON object, which has the Python natsort()
applied to it. I think MultiCloud and BubbleViz work along similar lines to Word Cloud.
Tokenizer is also pretty much the same, except it has lots of complications added to handle rotating the dataframe and to calculate extra statistics. I believe the data from the dataframe is extracted into a separate matrix
dict to which natsort()
is applied prior to output. It's possible that the slicing could be done on the natsorted dict. I'm not sure how that would affect performance.
gee, this is more than i thought; grrr, i'm not understanding all this right now, but the handling of the DTM/dataframe is important; very helpful post, thx
When you look at the code for Word Cloud in lexos.py
, it's pretty simple. Tokenizer is just has more operations performed on the dataframe. I'll see if I can add some extra commenting. But those are the two functions to look at.
Most the time, we don't need to do this, but occasionally it's necessary. For instance, if we use the
head()
function to grab the top N rows in the dataframe with rows of terms sorted by counts or frequencies, the excluded following rows at the cut off point may not be entirely as expected if the user is thinking alphabetically. This will be a very rare occurrence, but it could happen.It would be nice to figure out a way to implement natsorting of pandas dataframes, but it is not straightforward. The place to start is the note in the natsort documentation.