Closed David-Herman closed 7 years ago
Thanks for the report. Could you please output the categories as well?
Also, what is your "clean_text" function?
Ah. I was able to recreate the error if all the documents were labeled with the category "a". This should probably result in a more descriptive error, ideally within TermDocMatrixFactory base class.
And this will happen in 0.0.2.13, which should be coming soon!
Sorry for not getting back to you in time. Ok, I am glad you can replicate the error.
Another area for improved error message would be a bad clean_text. IE one that returns u'' for all documents. The error message would be unclear for debugging.
Hello,
I am playing with the internals of scatter text to utilize Bokeh as the front end visualization. I found for larger corpus's the time for the javascript to load to be excessive. With Bokeh I can serve up the text on the fly and dynamically re-parse the document based on filtering and such. Right now I am using several of the internal functions to generate the term document matrix to populate the graph data.
In my case I am playing with patent text documents. The results are looking very nice so far but I have encountered an issue shown below with the set of 5 documents (I get the same error for a larger set of documents as well). I replicated the problem in Juptyer notebook with a dump of the problematic document set (embedded in issue below as well).
sc = ScatterChartBokeh(corpus)
chart_dict = sc.to_dict(category='a', category_name='a', not_category_name='b',)
corpus.get_texts().tolist()