Closed andrewheusser closed 6 years ago
This looks awesome!
Is there (or should there be) a default text model (e.g. the Wikipedia model we've been kicking around)? Perhaps we could do something like:
text_model=None
(default) -- compute model from aggregate data (across all lists containing strings)text_model='wiki'
-- use Wikipedia topics/vocab, ignore all words not included in that vocabtext_model=<model object>
-- use user-specified modelDataGeometry
object?Another thought: should we support word2vec?
Supporting a wiki model would be a great idea, esp for short texts like tweets. Currently, text_model
refers to the type of model used for the text data (LDA or NMF), and in both cases, the model is derived from the input data. text_model=None
simply skips the modeling step and returns the vectorized text, which may be desired in some situations. So, maybe just adding atext_model='wiki'
option and a text_model=<model object>
would be sufficient. I also think supporting word2vec would be awesome, but need to do a little research to see what's available
For word2vec, did you see this library I linked to? https://github.com/danielfrg/word2vec
For text_model, it seems worth defining those arguments similarly to how reduce_model
, etc. are defined-- e.g.:
None
), we default to a pre-selected and pre-trained model (e.g. LDA with wikipedia-derived topics)I also haven't fully thought through what happens if some of the data gets specified as matrices, and other data gets specified as text. E.g.:
DataGeometry
object somehow? and then when new text data are passed to geo.plot
, we could use the fitted model to compute topics for the new text data.Ah, I didn't see your word2vec link, but I have used that particular library before. In the same way that we will have a predefined topic model fit to wiki data, we could have a predefined word2vec model fit to the same wiki dataset. If a sentence is passed as the input, its clear to me what the output of a topic model would be. However, word2vec outputs a vector for each word. Would we a) average the vectors together, b) plot a separate point for each word or c) some other behavior?
if the user doesn't specify anything (or defines as None), we default to a pre-selected and pre-trained model (e.g. LDA with wikipedia-derived topics)
If the user sets reduce=None
, the analyze
function simply returns the data with no dimensionality reduction. However, the default behavior is to reduce with IncrementalPCA
. We could default to either using a pretrained model, or deriving the model from the input data, I'm happy with either option, but I guess i'd lean toward a predefined model because it will typically be more stable unless the input data is large.
if the user specifies a string (for a supported model), we use that model with pre-selected parameters (e.g. number of features, etc.) ✔️
if the user specifies a dictionary (whose keys are arguments), we use that to fill in any defined features, reverting to defaults for whatever the user doesn't specify ✔️
I also haven't fully thought through what happens if some of the data gets specified as matrices, and other data gets specified as text. In my current implementation, this behavior is not supported, but if we can make it work, it may be useful
after fitting the text model, this means that different datasets might have different numbers of dimensions...so how do we deal with that? one idea is to force the text model to have the same number of dims as the other numerical matrices. However, this wouldn't work if a predefined model. 🤔
do we support count matrices as an alternative way of inputting "text" data? I think so? if you input a samples by words count matrix to hyp.plot, it will create a plot.
how do we display the results to the user? e.g. can they view the vocabulary and/or the topics? perhaps these (via a model object) should go in the DataGeometry object somehow? and then when new text data are passed to geo.plot, we could use the fitted model to compute topics for the new text data.
In the current implementation, the user would have to pass the vocab/text samples as labels (it's not done automatically). They don't have access to the vocab/topics. I don't see an intuitive place to store this info in the datageometry objects, so we may have to think about adding a field if we want to support this behavior. One this to clarify is whether we want to treat the text_model
as the reduce
model, or keep them separate. A reason to keep them separate is that you might want to fit the text data to a topic model with say 20 topics, but then reduce the data down to 3 dims with PCA or another reduction alg.
I'm working on the wikipedia-derived topic model. Right now, I am using the wikipedia python package to retrieve the text of pages that were specified in the Matlab wiki model. Then, I am planning to use the sklearn CountVectorizer
and LatentDirichletAllocation
to fit a topic model where n_topics=100. @jeremymanning what should I use for the alpha parameter? it defaults to 1/n_topics, but the matlab version seems to have used 25/n_topics:
with
Any other parameters I should change from default?
I created a model with 100 topics and the rest default params, except for the learning_method, which I changed from 'online' to 'batch' bc it threw a warning that the default will change in the next release. It's pretty large (500 mb). How do we want to handle the data? There are a few options I can think of, with varying times to implement:
1) Create a local folder for data, and if the model isn't there download it
2) Load it on the fly from google drive
3) Manually load it (e.g. wiki= hyp.load('wiki'); hyp.plot(text, text_model=wiki)
)
1 seems like the most elegant to me, but so far we've avoided having a local data folder.
The latest: I wrote a text2mat
function, which takes a list (or list of lists) of text samples as input and converts them to matrices using a vectorizer (by setting the vectorizer
parameter to count or tfidf, or custom) followed by a text model (by setting the text_model
parameter to LDA or NMF or custom). The custom model can be prefit model instance (like the wiki model), or a class. The custom models must follow the scikit-learn transformer API (fit, transform, fit_transform methods).
To implement this in the plot function so that users can pass text directly to hyp.plot
, i will:
format_data
function to detect text data and convert it into a samples by features numpy array where n_topics
(default:20) defines the number of features. (Since NMF does not return 'topics' per se, should we have a more general kwarg name, like n_text_features
)?DataGeometry
object?hey @andrewheusser -- where are we with this?
almost done but i still want to add the option to use the fit wiki
model. i've created the model but i haven't set up a way to access it via download or load from disk if it exists
:+1: got it-- thanks!
@jeremymanning i think this is finally ready for your review. here is a list of the changes made on this PR: https://github.com/ContextLab/hypertools/releases/tag/untagged-86d0fbc6541a2e29d6bb. Let me know if you have questions!
I tried the this demo that you describe above
data = [['i like cats alot', 'cats r pretty cool', 'cats are better than dogs'],
['dogs rule the haus', 'dogs are my jam', 'dogs are a mans best friend']]
hyp.plot(data,'o')
I'm getting this error:
---------------------------------------------------------------------------
EOFError Traceback (most recent call last)
<ipython-input-3-0b520719b960> in <module>()
1 data = [['i like cats alot', 'cats r pretty cool', 'cats are better than dogs'],
2 ['dogs rule the haus', 'dogs are my jam', 'dogs are a mans best friend']]
----> 3 hyp.plot(data,'o')
/usr/local/lib/python3.6/site-packages/hypertools/plot/plot.py in plot(x, fmt, marker, markers, linestyle, linestyles, color, colors, palette, group, hue, labels, legend, title, size, elev, azim, ndims, model, model_params, reduce, cluster, align, normalize, n_clusters, save_path, animate, duration, tail_duration, rotations, zoom, chemtrails, precog, bullettime, frame_rate, explore, show, transform, vectorizer, semantic, corpus, ax)
246 # analyze the data
247 if transform is None:
--> 248 raw = format_data(x, **text_args)
249 xform = analyze(raw, ndims=ndims, normalize=normalize, reduce=reduce,
250 align=align, internal=True)
/usr/local/lib/python3.6/site-packages/hypertools/tools/format_data.py in format_data(x, vectorizer, semantic, corpus, ppca, text_align)
120 text_data.append(np.array(i).reshape(-1, 1))
121 # convert text to numerical matrices
--> 122 text_data = text2mat(text_data, **text_args)
123
124 # replace the text data with transformed data
/usr/local/lib/python3.6/site-packages/hypertools/_shared/helpers.py in memoizer(*args, **kwargs)
169 key = str(args) + str(kwargs)
170 if key not in cache:
--> 171 cache[key] = obj(*args, **kwargs)
172 return cache[key]
173 return memoizer
/usr/local/lib/python3.6/site-packages/hypertools/tools/text2mat.py in text2mat(data, vectorizer, semantic, corpus)
80 semantic = 'LatentDirichletAllocation'
81 elif semantic in ('wiki', 'nips', 'sotus',):
---> 82 semantic = load(semantic + '_model')
83 vectorizer = None
84 model_is_fit = True
/usr/local/lib/python3.6/site-packages/hypertools/tools/load.py in load(dataset, reduce, ndims, align, normalize, download)
108 data = DataGeometry(**geo)
109 elif dataset in datadict.keys():
--> 110 data = _load_data(dataset, datadict[dataset])
111 else:
112 raise RuntimeError('No data loaded. Please specify a .geo file or '
/usr/local/lib/python3.6/site-packages/hypertools/tools/load.py in _load_data(dataset, fileid)
146 data = _load_from_disk(dataset)
147 else:
--> 148 data = _load_from_disk(dataset)
149 return data
150
/usr/local/lib/python3.6/site-packages/hypertools/tools/load.py in _load_from_disk(dataset)
174 try:
175 with open(fullpath, 'rb') as f:
--> 176 return pickle.load(f)
177 except ValueError as e:
178 print(e)
EOFError: Ran out of input
@jeremymanning - i modified the load function with a try statement to attempt to load in an example dataset and if that fails, redownload the dataset and load it in. I think this should fix the issue you were having above
i think this is ready to merge now!
This looks great! However, I found a couple of bugs (I think):
sotus = hyp.load('sotus')
hyp.plot(sotus, '.') #why are the dots different colors? how is coloring determined?
hyp.plot(sotus) #nothing shows up-- but I think this should result in a line plot
^ typos corrected above
hmm, i think this is the expected behavior. sotus
is a geo
, so you can just do:
geo = hyp.load('sotus')
geo.plot()
although, the way you didn't isn't wrong because hyp.plot
can handle geos
. The colors are different for different groups of dots because the data is parsed up into a list of numpy arrays, where each array contains a different president's sotus e.g. [bush1, bush2, clinton...]
. The labels do not show up because when you pass a geo
to hyp.plot
, the default arguments are applied (and the default is labels=None
). We could change this such that hyp.plot(geo)
just calls geo.plot()
internally, but then any arguments that are input would have to be ignored i think.
ah, i didn't see the second one. thats definitely a bug haha
@andrewheusser is the bug now squashed or should i hold off on further review?
Not squashed! hold off and I'll tackle it after CNS
On Sat, Mar 24, 2018 at 3:21 PM Jeremy Manning notifications@github.com wrote:
@andrewheusser https://github.com/andrewheusser is the bug now squashed or should i hold off on further review?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ContextLab/hypertools/pull/179#issuecomment-375918073, or mute the thread https://github.com/notifications/unsubscribe-auth/AGSn19VLvVx7i7qNvuDgHbv95gGYovWwks5thpydgaJpZM4Re_Xn .
now that i'm thinking about this...i'm wondering about our design decision to support handling of geos in hyp.plot
. It doesn't seem necessary considering that all geos have the plot method already attached to them. for example:
sotus = hyp.load('sotus')
sotus.plot() # this is the intended API and works
hyp.plot(sotus) # the line plots don't show up
before i dig into why, i wanted to see if there was a good reason to support geo as an input format for hyp.plot
. One reason i can think of is that it resets the default arguments, allowing the user to create a new geo with all of the default arguments, but that's really the only difference i think..
ah - i figured out why hyp.plot(sotus)
doesn't work. after processing the data is list of 1x3 matrices, and each matrix has only 1 coordinate (and you need 2 coordinates to draw a line) so matplotlib doesn't draw anything. for this to be drawn as a line, the text data would need to be input as a list of lists of strings, instead of a list of strings.
Shouldn't we support lists of strings in addition to lists of lists of strings? A list of strings is the analog of a single array or dataframe, and a list of lists of strings is like a list of arrays/dataframes
we do support a list of strings (in addition to lists of lists of strings). its that for lists of strings, each string is treated as a document and transformed to a single point. this works fine with point plots, but line plots wont work.
this is actually not specific to text. if you do something like:
hyp.plot([np.random.rand(1, 10) for i in range(10)])
it will create an empty plot because of the way that matplotlib handles drawing lines from single points (it doesnt draw anything).
possible solution: if arr.shape[0]==1, format as a dot instead of a line
This makes sense... But we should reformat the sotu data so that it works as intended (if that's not already done). I like the "force plotting a dot if only one observation" solution.
Python 3 notes (same behavior as 2.7 unless noted below):
hyp.describe(sotus)
, empty plotPlotting a single string or a produces no plot-- e.g. hyp.plot('this is a test'). Same with plotting a single list of length 1 with a string (hyp.plot(['this is a test'])). Both of these should either plot a single point, or output a warning that the use case is not supported. The issue was more a general bug with handling datasets where the nrows < ndims. Now, if dimensionality cannot be performed (nrows==1 across all datasets), a warning is thrown and zeros of shape (1, ndims) are returned, elif the number of rows < ndims, a warning is thrown that says the data will be reduced to the number of rows
Plotting a list of length 2 (both with strings) also produces no output-- e.g. hyp.plot(['this is a test', 'is it not?']). I'm not sure what the expected behavior is, but I was thinking I'd get a line...? In general, I was thinking each string represents one document (a point), and each list represents a collection of documents (a trajectory). If the user passes a list of lists of strings, that reflects multiple collections of documents. This is now fixed such that if a list of strings is passed, a line will be plotted. if nrows<ndims (as is the case in this example, a warning will be thrown that the data dimensionality will be reduced to nrows
By the same logic, this should produce a line (for the first document collection) and a single point (for the second document collection): hyp.plot([['this is a test', 'is it not?'], ['yes, i think it is a test']]) This (correctly) plots two trajectories as expected: hyp.plot([['this is a test', 'is it not?'], ['yes, i think it is a test', 'but i don''t like tests!']]) -- so something seems off about the above examples (possibly the same issue related to plotting a single point for a single document, even if the user specifies a line) if the shape of an array is 1 x something, we now plot a point, even when line is specified as the format string (which is the default).
I think this should result in each of wiki, nips, and weights being plotted in different colors (or possibly each element of those data structures being plotted in different colors): hyp.plot([wiki, nips, weights], '.') When regenerating the text geos, i accidentally saved the nips data in the wiki geo...its corrected now. i.e. wiki is all plotted in one color, nips in another and weights plotted in many colors (because its a list of matrices). (you'll have to clear your cache to make it work if you want to test it out again).
Future feature request: make geo objects iterable and indexible, and possibly have them be extensions of numpy arrays or dataframes Adding an issue
hyp.load "no data loaded" error should be updated to include new (text) datasets Done
describe function seems to not be working-- e.g. hyp.describe(sotus) Works now!
should we get rid of describe_pca? Yes, we had a warning in the last version that it is deprecated. Done.
almost everything looks good-- except hyp.describe(sotus)
still isn't working for me:
/Users/jmanning/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/hypertools/tools/describe.py:62: UserWarning: When input data is large, this computation can take a long time.
warnings.warn('When input data is large, this computation can take a long time.')
/Users/jmanning/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/seaborn/timeseries.py:183: UserWarning: The tsplot function is deprecated and will be removed or replaced (in a substantially altered version) in a future release.
warnings.warn(msg, UserWarning)
Seems like a seaborn thing? Maybe as simple as updating the requirements list...
hmmm, it works fine for me! the seaborn thing is just a warning that they will be deprecating tsplot (which the desccribe function uses). are you getting an error somewhere else?
Ah. Here is the actual error (previous test was w/ python 2; here is the python 3 error):
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-003ba24b692e> in <module>()
----> 1 hyp.describe(sotus)
/usr/local/lib/python3.6/site-packages/hypertools/tools/describe.py in describe(x, reduce, max_dims, show, format_data)
100 if show:
101 fig, ax = plt.subplots()
--> 102 ax = sns.tsplot(data=result['individual'], time=[i for i in range(2, max_dims+2)], err_style="unit_traces")
103 ax.set_title('Correlation with raw data by number of components')
104 ax.set_ylabel('Correlation')
/usr/local/lib/python3.6/site-packages/seaborn/timeseries.py in tsplot(data, time, unit, condition, value, err_style, ci, interpolate, color, estimator, n_boot, err_palette, err_kws, legend, ax, **kwargs)
264 time=times,
265 unit=units,
--> 266 cond=conds))
267
268 # Set up the err_style and ci arguments for the loop below
/usr/local/lib/python3.6/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
328 dtype=dtype, copy=copy)
329 elif isinstance(data, dict):
--> 330 mgr = self._init_dict(data, index, columns, dtype=dtype)
331 elif isinstance(data, ma.MaskedArray):
332 import numpy.ma.mrecords as mrecords
/usr/local/lib/python3.6/site-packages/pandas/core/frame.py in _init_dict(self, data, index, columns, dtype)
459 arrays = [data[k] for k in keys]
460
--> 461 return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
462
463 def _init_ndarray(self, values, index, columns, dtype=None, copy=False):
/usr/local/lib/python3.6/site-packages/pandas/core/frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
6161 # figure out the index, if necessary
6162 if index is None:
-> 6163 index = extract_index(arrays)
6164 else:
6165 index = _ensure_index(index)
/usr/local/lib/python3.6/site-packages/pandas/core/frame.py in extract_index(data)
6209 lengths = list(set(raw_lengths))
6210 if len(lengths) > 1:
-> 6211 raise ValueError('arrays must all be same length')
6212
6213 if have_dicts:
ValueError: arrays must all be same length
ah, can you try clearing your cache? you probably have an old version of the sotus dataset. the cache is in /Users/yourname/hypertools_data/
that worked! let's either clear the cache on installation of this version or add something to the documentation warning users of this issue. merging...
This PR adds the ability to plot text data. For example:
yields a plot where each dot represents a sentence that was vectorized using sklearn's
CountVectorizer
and then modeled usingLatentDirichletAllocation
.To plot just the vectorized text, simply set
hyp.plot(data, text_model=None)
I exposed the
hyp.tools.text2mat
function to the user, and that's what does the heavy lifting. It can vectorize the data usingCountVectorizer
orTfidfVectorizer
and model the data usingLDA
orNMF
.