JasonKessler / scattertext

Beautiful visualizations of how language differs among document types.
Apache License 2.0
2.23k stars 289 forks source link

[Question] Why do Document-Based Scatterplots need category? #44

Open fredguth opened 5 years ago

fredguth commented 5 years ago

Sorry to ask via issue tracker, tried to find the answer in the referred arxived article and did not know of any other better channel.

I am trying to figure out how the Document-Based Scatterplot works.

I get that it uses Tf-Idf on unigrams of the text and takes the 2 first unigrams of the vector (the most different terms?) as axis. But what function is applied to each document to find its x-y position? Its "nearess" to each term?

Besides, I don't understand why we need to provide Category in this case. I understood it uses category to colorize the points, but anything else? Because if it's just that, it seems a hard constraint to Document-Based Scatterplot for something one may not need. But I guess I am missing something.

fredguth commented 5 years ago

Related to the previous question, how can I find out which term was used as axis?