Support for text features and CountVectorizer matrices

When the user passes text to hypertools, we could turn the text into a CountVectorizer matrix and plot it (or analyze it) using the existing hypertools functions.

Similarly, we could directly support CountVectorizer matrices.

Sample code: https://github.com/ContextLab/storytelling-with-data/blob/master/data-stories/twitter-finance/twitter-finance.ipynb

This would be especially useful in conjunction with using LDA or NMF to cluster or reduce the data see this issue. For example, the user could pass in a list of lists of strings (one list per theme-- e.g. a collection of tweets from one user) and get back a list of topic vector matrices, all fit using a common model.

I've started working on this in the text-features branch. I wrote a text2mat function, which takes a list (or list of lists) of text samples as input and converts them to matrices using a vectorizer (count or tfidf, or custom) followed by a text model (LDA or NMF or custom).

To implement this in the plot function so that users can pass text directly to hyp.plot, extend the format_data function to detect text data, count or tfidf matrices and convert it into an array.

A few questions that have come up so far:

In the typical case, if a user passes multiple lists of text, a single model is used to fit and transform the data. If the user passes a mixed list of numerical and text data, should we assume they also want to use a single model to fit the text data matrices?
We discussed automatically aligning in format_data if text and numerical types are both present. Another option would be to simply pad any matrix (including transformed text data) to the size of the largest matrix. Then, the user could select alignment if they like, but it would be mandatory. Which of these solutions makes more sense? Are there others to consider?
We need to distinguish between a parameter defining the number of dimensions for the text model and the number of dimensions for the text model. Use case: we want to fit a model with 20 topics, and then plot in 2D. Thoughts on the choice of parameter name? n_topics would be good, but NMF doesn't return 'topics' right? maybe text_dims?

cc: @jeremymanning

There are several possible cases that I think we should support:

Text passed as strings and lists:

Text passed as a single string -- e.g. hyp.plot('this is some text'). Treat the string as a single document.
Text passed as a list of strings -- e.g. hyp.plot(['this is', 'some text']). Treat each string as a document, and the list as a collection (so there will be 2 observations to plot in this case).
Text passed as a list of lists of strings -- e.g. hyp.plot([['this is', 'some text'], ['and here is', 'some other text', 'to plot']]). Treat each list as a collection and each string as a document (so there would be 2 observations for the first collection and 3 for the second).
Mixed lists of lists and strings -- e.g. hyp.plot([['this is', 'some text'], 'and here is some other text']). Treat each string as a document and each list as a collection. So this is equivalent to if the user had called hyp.plot([['this is', 'some text'], ['and here is some other text']])

In each case, each document (or collection of documents) should get processed into a CountVectorizer object.

Text passed as CountVectorizer objects:

Each CountVectorizer object is equivalent to a list of strings (i.e. a collection of documents). So if x is a CountVectorizer object created from the list of strings ['this is some', 'text to plot', 'organized as a single collection'] then hyp.plot([x, ['another collection', 'for us to deal with']]) should be equivalent to if we had instead called hyp.plot([['this is some', 'text to plot', 'organized as a single collection'], ['another collection', 'for us to deal with']]). In other words, CountVectorizor objects should be treated just like lists of strings that have already been pressed.
If the user passes multiple CountVectorizer objects, we need to verify that the vocabularies match. If not, a new vocabulary should be constructed (the union of all vocabularies of all CountVectorizer objects, plus all words/documents passed via strings, excluding stop words) and CountVectorizer objects should be rebuilt.

All text and text-related objects (i.e. strings, lists of strings, lists of lists of strings, and CountVectorizor objects) should be processed using the default or user-specified text model (e.g. LDA, NMF, etc.). The model should be fit as follows:

Convert each text-related dataset to a CountVectorizor object and make sure all CountVectorizor objects have matching vocabularies
Then create a new (temporary) combined CountVectorizor object that concatenates all of the CountVectorizer objects into a single object
Fit the text model to that combined CountVectorizor object and get feature vectors for each document
Now split apart those features into a list of the same length as the original data list the user passed in
The result is that each text-related object should be replaced with a matrix of topic vectors (or word features)

Mixed text and non-text data If the user passes text and non-text data, then this should force align=True (unless the user has already specified that the data should be aligned). The full sequence I'm imagining is:

Within format_data, detect that we're in the "combined text and non-text" scenario
Separate our the text-related data, turn into CountVectorizor objects, fit a text model, and replace those datasets with topic vector matrices (or word feature matrices)
Now deal with the non-text data, just like we normally do within format_data (e.g. convert everything to a numpy array)
If (and only if) we're in this combined text/non-text case and the non-text data all have the same number of columns and all of the datasets have the same number of observations, then we can align everything from within format_data and return an aligned dataset where everything is in a common space.
- if we're in this combined text/non-text case and either the non-text data don't have the same number of columns, or if the datasets have different numbers of observations, then format_data should skip the align step and simply return a mismatched list of numpy arrays. This should cause an error from within plot (or wherever format_data is being called, if from within hypertools).

Text passed as strings and lists: Text passed as a single string -- e.g. hyp.plot('this is some text'). Treat the string as a single document. Text passed as a list of strings -- e.g. hyp.plot(['this is', 'some text']). Treat each string as a document, and the list as a collection (so there will be 2 observations to plot in this case). Text passed as a list of lists of strings -- e.g. hyp.plot([['this is', 'some text'], ['and here is', 'some other text', 'to plot']]). Treat each list as a collection and each string as a document (so there would be 2 observations for the first collection and 3 for the second). Mixed lists of lists and strings -- e.g. hyp.plot([['this is', 'some text'], 'and here is some other text']). Treat each string as a document and each list as a collection. So this is equivalent to if the user had called hyp.plot([['this is', 'some text'], ['and here is some other text']]) I've got this working for strings, lists of strings, lists of lists for strings, and mixed lists

Text passed as CountVectorizer objects: Each CountVectorizer object is equivalent to a list of strings (i.e. a collection of documents). So if x is a CountVectorizer object created from the list of strings ['this is some', 'text to plot', 'organized as a single collection'] then hyp.plot([x, ['another collection', 'for us to deal with']]) should be equivalent to if we had instead called hyp.plot([['this is some', 'text to plot', 'organized as a single collection'], ['another collection', 'for us to deal with']]). In other words, CountVectorizor objects should be treated just like lists of strings that have already been pressed. If the user passes multiple CountVectorizer objects, we need to verify that the vocabularies match. If not, a new vocabulary should be constructed (the union of all vocabularies of all CountVectorizer objects, plus all words/documents passed via strings, excluding stop words) and CountVectorizer objects should be rebuilt. There is an issue with this that I'm only realizing after chugging away on this for a bit...CountVectorizer objects are models of text data, but don't hold onto the 'training' data that is passed to them. Data that has been transformed by a CountVectorizer is stored as a sparse matrix, which doesn't contain the original vocab words as far as I can tell.

_If (and only if) we're in this combined text/non-text case and the non-text data all have the same number of columns and all of the datasets have the same number of observations, then we can align everything from within formatdata and return an aligned dataset where everything is in a common space. I'm missing why the non-text data need to have the same number of features in this case. I get that you need the same number of observations for hyperalignment, but it seems to me that hyperalignment would work fine if you had numerical data with diff number of columns + text data.

There is an issue with this that I'm only realizing after chugging away on this for a bit...CountVectorizer objects are models of text data, but don't hold onto the 'training' data that is passed to them. Data that has been transformed by a CountVectorizer is stored as a sparse matrix, which doesn't contain the original vocab words as far as I can tell.

Topic models don't care about the text order-- the steps to get topic vectors from text is:

Convert text to CountVectorizer object
Fit topic model using CountVectorizer object as the input

So all we have to do with CountVectorizer objects is skip the first step

If (and only if) we're in this combined text/non-text case and the non-text data all have the same number of columns and all of the datasets have the same number of observations, then we can align everything from within format_data and return an aligned dataset where everything is in a common space.

I'm missing why the non-text data need to have the same number of features in this case. I get that you need the same number of observations for hyperalignment, but it seems to me that hyperalignment would work fine if you had numerical data with diff number of columns + text data.

It's true that hyperalignment will run if the number of features are mismatched. But the way I'm thinking about this is that we want to preserve/match the behavior with non-text data to the extent possible. If the number of dimensions don't align and no text data gets passed, we currently throw an error. What I'm proposing is that we add an additional exception (that text data don't have to have the same number of features-- since we're creating those features inside of format_data, after the user has already passed the data to hypertools). But anything that threw an error without that additional text data should still throw an error even if text data gets added to the data list.

Convert text to CountVectorizer object Fit topic model using CountVectorizer object as the input

As I understand it, the input to the LDA model is text data that has been transformed by a CountVectorizer object, which is just a samples by features matrix of word counts (not a class instance of a CountVectorizer object). The way I've got it set up now is that you can pass a 'custom' CountVectorizer object (fit or unfit) to text2mat using the vectorizer kwarg. If it is already fit, it will skip the fitting step and just transform each of the text elements with that model. In the same way, the user can pass a 'custom' fit (or unfit) text model (LDA or NMF class or class instance) using the text kwarg.

_It's true that hyperalignment will run if the number of features are mismatched. But the way I'm thinking about this is that we want to preserve/match the behavior with non-text data to the extent possible. If the number of dimensions don't align and no text data gets passed, we currently throw an error. What I'm proposing is that we add an additional exception (that text data don't have to have the same number of features-- since we're creating those features inside of formatdata, after the user has already passed the data to hypertools). But anything that threw an error without that additional text data should still throw an error even if text data gets added to the data list.

Great! thanks for clarifying!

I was imagining that we'd just support fitted CountVectorizer objects, as an alternative to passing in the text directly and fitting a CountVectorizer from that. But in digging into this more, I'm realizing the setup I was imagining won't work-- I had thought we could pass CountVectorizer objects directly to LDA, but (as you pointed out) that's not actually what LDA supports.

So given this "new" information/realization, I'm now leaning towards nixing support for CountVectorizer objects in the way I had initially described. What you've described re: specifying a vectorizer seems like a good approach to me.

Alright that sounds good to me. Just a few more questions before I think it's ready to merge:

1) As a shortcut to specifying a dictionary, the text2mat function has a n_components kwarg to specify the number of text dimensions. Do we want to remove that, and just support the dictionary input format (text2mat(text_samples, text={'model' : 'LatentDiricheletAllocation', 'params' : {'n_components' : 50}}). I'm leaning toward keeping it, or some other keyword because its a lot to write out the full dictionary if you just want to change the dimensionality, which seems like a common parameter users would want to tweak. The other functions support this behavior as well (reduce=ndims, cluster=n_clusters), but we talked about deprecating them.

2) Since format_data wraps text2mat, and I exposed format_data in this latest code, do we want to expose text2mat? It's essentially a subfunction of format_data, but specifically handles the text data.

3) Do we want to add the text model to the geo class?

I worry that n_components is going to be confused with n_dims, so I don't think we should expose that flag to the user unless it's obvious that they are referring to a text model. If they want a quick fit, they can just trust the default parameters-- and if they want to customize, then we offer a way to do that in one line (analogous to how they can tweak the parameters of the reduce and cluster models).
I think format_data is sufficient (without exposing text2mat directly to the user)-- if the user passes text data to format_data, isn't the behavior the same as if they had called text2mat directly? What would separating out text2mat buy the user in terms of convenience of functionality?
Yeah, let's add the text model to the geo class so that we can fit new text with the already-fit model (so that everything stays compatible). Good idea.

_I worry that n_components is going to be confused with ndims, so I don't think we should expose that flag to the user unless it's obvious that they are referring to a text model. If they want a quick fit, they can just trust the default parameters-- and if they want to customize, then we offer a way to do that in one line (analogous to how they can tweak the parameters of the reduce and cluster models). Roger that!

_I think format_data is sufficient (without exposing text2mat directly to the user)-- if the user passes text data to formatdata, isn't the behavior the same as if they had called text2mat directly? What would separating out text2mat buy the user in terms of convenience of functionality? It would be the same behavior, but more limited (to text), so I'll leave it private

Yeah, let's add the text model to the geo class so that we can fit new text with the already-fit model (so that everything stays compatible). Good idea. Sounds good

ContextLab / hypertools

Support for text features and CountVectorizer matrices #175