Closed jeremymanning closed 6 years ago
I've started working on this in the text-features branch. I wrote a text2mat
function, which takes a list (or list of lists) of text samples as input and converts them to matrices using a vectorizer (count or tfidf, or custom) followed by a text model (LDA or NMF or custom).
To implement this in the plot function so that users can pass text directly to hyp.plot
, extend the format_data
function to detect text data, count or tfidf matrices and convert it into an array.
A few questions that have come up so far:
In the typical case, if a user passes multiple lists of text, a single model is used to fit and transform the data. If the user passes a mixed list of numerical and text data, should we assume they also want to use a single model to fit the text data matrices?
We discussed automatically aligning in format_data if text and numerical types are both present. Another option would be to simply pad any matrix (including transformed text data) to the size of the largest matrix. Then, the user could select alignment if they like, but it would be mandatory. Which of these solutions makes more sense? Are there others to consider?
We need to distinguish between a parameter defining the number of dimensions for the text model and the number of dimensions for the text model. Use case: we want to fit a model with 20 topics, and then plot in 2D. Thoughts on the choice of parameter name? n_topics
would be good, but NMF doesn't return 'topics' right? maybe text_dims
?
cc: @jeremymanning
There are several possible cases that I think we should support:
Text passed as strings and lists:
hyp.plot('this is some text')
. Treat the string as a single document.hyp.plot(['this is', 'some text'])
. Treat each string as a document, and the list as a collection (so there will be 2 observations to plot in this case).hyp.plot([['this is', 'some text'], ['and here is', 'some other text', 'to plot']])
. Treat each list as a collection and each string as a document (so there would be 2 observations for the first collection and 3 for the second).hyp.plot([['this is', 'some text'], 'and here is some other text'])
. Treat each string as a document and each list as a collection. So this is equivalent to if the user had called hyp.plot([['this is', 'some text'], ['and here is some other text']])
In each case, each document (or collection of documents) should get processed into a CountVectorizer
object.
Text passed as CountVectorizer
objects:
CountVectorizer
object is equivalent to a list of strings (i.e. a collection of documents). So if x
is a CountVectorizer
object created from the list of strings ['this is some', 'text to plot', 'organized as a single collection']
then hyp.plot([x, ['another collection', 'for us to deal with']])
should be equivalent to if we had instead called hyp.plot([['this is some', 'text to plot', 'organized as a single collection'], ['another collection', 'for us to deal with']])
. In other words, CountVectorizor
objects should be treated just like lists of strings that have already been pressed.CountVectorizer
objects, we need to verify that the vocabularies match. If not, a new vocabulary should be constructed (the union of all vocabularies of all CountVectorizer
objects, plus all words/documents passed via strings, excluding stop words) and CountVectorizer objects should be rebuilt.All text and text-related objects (i.e. strings, lists of strings, lists of lists of strings, and CountVectorizor objects) should be processed using the default or user-specified text model (e.g. LDA, NMF, etc.). The model should be fit as follows:
Mixed text and non-text data
If the user passes text and non-text data, then this should force align=True
(unless the user has already specified that the data should be aligned). The full sequence I'm imagining is:
format_data
, detect that we're in the "combined text and non-text" scenarioformat_data
(e.g. convert everything to a numpy array)format_data
and return an aligned dataset where everything is in a common space.
format_data
should skip the align
step and simply return a mismatched list of numpy arrays. This should cause an error from within plot
(or wherever format_data
is being called, if from within hypertools).Text passed as strings and lists: Text passed as a single string -- e.g. hyp.plot('this is some text'). Treat the string as a single document. Text passed as a list of strings -- e.g. hyp.plot(['this is', 'some text']). Treat each string as a document, and the list as a collection (so there will be 2 observations to plot in this case). Text passed as a list of lists of strings -- e.g. hyp.plot([['this is', 'some text'], ['and here is', 'some other text', 'to plot']]). Treat each list as a collection and each string as a document (so there would be 2 observations for the first collection and 3 for the second). Mixed lists of lists and strings -- e.g. hyp.plot([['this is', 'some text'], 'and here is some other text']). Treat each string as a document and each list as a collection. So this is equivalent to if the user had called hyp.plot([['this is', 'some text'], ['and here is some other text']]) I've got this working for strings, lists of strings, lists of lists for strings, and mixed lists
Text passed as CountVectorizer objects:
Each CountVectorizer object is equivalent to a list of strings (i.e. a collection of documents). So if x is a CountVectorizer object created from the list of strings ['this is some', 'text to plot', 'organized as a single collection'] then hyp.plot([x, ['another collection', 'for us to deal with']]) should be equivalent to if we had instead called hyp.plot([['this is some', 'text to plot', 'organized as a single collection'], ['another collection', 'for us to deal with']]). In other words, CountVectorizor objects should be treated just like lists of strings that have already been pressed.
If the user passes multiple CountVectorizer objects, we need to verify that the vocabularies match. If not, a new vocabulary should be constructed (the union of all vocabularies of all CountVectorizer objects, plus all words/documents passed via strings, excluding stop words) and CountVectorizer objects should be rebuilt.
There is an issue with this that I'm only realizing after chugging away on this for a bit...CountVectorizer
objects are models of text data, but don't hold onto the 'training' data that is passed to them. Data that has been transformed by a CountVectorizer
is stored as a sparse matrix, which doesn't contain the original vocab words as far as I can tell.
_If (and only if) we're in this combined text/non-text case and the non-text data all have the same number of columns and all of the datasets have the same number of observations, then we can align everything from within formatdata and return an aligned dataset where everything is in a common space. I'm missing why the non-text data need to have the same number of features in this case. I get that you need the same number of observations for hyperalignment, but it seems to me that hyperalignment would work fine if you had numerical data with diff number of columns + text data.
There is an issue with this that I'm only realizing after chugging away on this for a bit...CountVectorizer objects are models of text data, but don't hold onto the 'training' data that is passed to them. Data that has been transformed by a CountVectorizer is stored as a sparse matrix, which doesn't contain the original vocab words as far as I can tell.
Topic models don't care about the text order-- the steps to get topic vectors from text is:
So all we have to do with CountVectorizer objects is skip the first step
If (and only if) we're in this combined text/non-text case and the non-text data all have the same number of columns and all of the datasets have the same number of observations, then we can align everything from within format_data and return an aligned dataset where everything is in a common space.
I'm missing why the non-text data need to have the same number of features in this case. I get that you need the same number of observations for hyperalignment, but it seems to me that hyperalignment would work fine if you had numerical data with diff number of columns + text data.
It's true that hyperalignment will run if the number of features are mismatched. But the way I'm thinking about this is that we want to preserve/match the behavior with non-text data to the extent possible. If the number of dimensions don't align and no text data gets passed, we currently throw an error. What I'm proposing is that we add an additional exception (that text data don't have to have the same number of features-- since we're creating those features inside of format_data
, after the user has already passed the data to hypertools). But anything that threw an error without that additional text data should still throw an error even if text data gets added to the data list.
Convert text to CountVectorizer object Fit topic model using CountVectorizer object as the input
As I understand it, the input to the LDA model is text data that has been transformed by a CountVectorizer
object, which is just a samples by features matrix of word counts (not a class instance of a CountVectorizer
object). The way I've got it set up now is that you can pass a 'custom' CountVectorizer
object (fit or unfit) to text2mat
using the vectorizer
kwarg. If it is already fit, it will skip the fitting step and just transform each of the text elements with that model. In the same way, the user can pass a 'custom' fit (or unfit) text model (LDA
or NMF
class or class instance) using the text
kwarg.
_It's true that hyperalignment will run if the number of features are mismatched. But the way I'm thinking about this is that we want to preserve/match the behavior with non-text data to the extent possible. If the number of dimensions don't align and no text data gets passed, we currently throw an error. What I'm proposing is that we add an additional exception (that text data don't have to have the same number of features-- since we're creating those features inside of formatdata, after the user has already passed the data to hypertools). But anything that threw an error without that additional text data should still throw an error even if text data gets added to the data list.
Great! thanks for clarifying!
I was imagining that we'd just support fitted CountVectorizer objects, as an alternative to passing in the text directly and fitting a CountVectorizer from that. But in digging into this more, I'm realizing the setup I was imagining won't work-- I had thought we could pass CountVectorizer objects directly to LDA, but (as you pointed out) that's not actually what LDA supports.
So given this "new" information/realization, I'm now leaning towards nixing support for CountVectorizer objects in the way I had initially described. What you've described re: specifying a vectorizer
seems like a good approach to me.
Alright that sounds good to me. Just a few more questions before I think it's ready to merge:
1) As a shortcut to specifying a dictionary, the text2mat
function has a n_components
kwarg to specify the number of text dimensions. Do we want to remove that, and just support the dictionary input format (text2mat(text_samples, text={'model' : 'LatentDiricheletAllocation', 'params' : {'n_components' : 50}}
). I'm leaning toward keeping it, or some other keyword because its a lot to write out the full dictionary if you just want to change the dimensionality, which seems like a common parameter users would want to tweak. The other functions support this behavior as well (reduce=ndims, cluster=n_clusters), but we talked about deprecating them.
2) Since format_data
wraps text2mat
, and I exposed format_data
in this latest code, do we want to expose text2mat
? It's essentially a subfunction of format_data
, but specifically handles the text data.
3) Do we want to add the text model to the geo class?
n_components
is going to be confused with n_dims
, so I don't think we should expose that flag to the user unless it's obvious that they are referring to a text model. If they want a quick fit, they can just trust the default parameters-- and if they want to customize, then we offer a way to do that in one line (analogous to how they can tweak the parameters of the reduce and cluster models).format_data
is sufficient (without exposing text2mat
directly to the user)-- if the user passes text data to format_data
, isn't the behavior the same as if they had called text2mat
directly? What would separating out text2mat
buy the user in terms of convenience of functionality?_I worry that n_components is going to be confused with ndims, so I don't think we should expose that flag to the user unless it's obvious that they are referring to a text model. If they want a quick fit, they can just trust the default parameters-- and if they want to customize, then we offer a way to do that in one line (analogous to how they can tweak the parameters of the reduce and cluster models). Roger that!
_I think format_data is sufficient (without exposing text2mat directly to the user)-- if the user passes text data to formatdata, isn't the behavior the same as if they had called text2mat directly? What would separating out text2mat buy the user in terms of convenience of functionality? It would be the same behavior, but more limited (to text), so I'll leave it private
Yeah, let's add the text model to the geo class so that we can fit new text with the already-fit model (so that everything stays compatible). Good idea. Sounds good
When the user passes text to hypertools, we could turn the text into a CountVectorizer matrix and plot it (or analyze it) using the existing hypertools functions.
Similarly, we could directly support CountVectorizer matrices.
Sample code: https://github.com/ContextLab/storytelling-with-data/blob/master/data-stories/twitter-finance/twitter-finance.ipynb
This would be especially useful in conjunction with using LDA or NMF to cluster or reduce the data see this issue. For example, the user could pass in a list of lists of strings (one list per theme-- e.g. a collection of tweets from one user) and get back a list of topic vector matrices, all fit using a common model.