datalab-dev / quintessence_web_app

Repository for the Quintessence Web Project applying Topic Models and Word Embeddings to EEBO-TCP
http://quintessence.ds.lib.ucdavis.edu/
0 stars 0 forks source link

Add information on hover of topics in scatter plot #10

Closed avkoehl closed 3 years ago

avkoehl commented 5 years ago

For the topic model visualization, when hovering over a topic the following information should be displayed (we may have some setting to determine which of these information you actually want to see): topic proportion as percent of corpus top key words associated with the topic top authors associated with the topic top words for that topic topic cluster (if available)

avkoehl commented 4 years ago

closely related to #31

sampizelo commented 4 years ago

Foreach topic: for unique(meta_column): mean(topic_doc score), sort descending, paste(top 5, sep=“< /br>”) —> cbind to LDA_vis table.

Column for the following: 1) Top 5 Authors/Topic 2) Top 5 Locations/Topic 3) Top 5 Keywords/Topic 4) Top 5 Publishers/Topic

sampizelo commented 4 years ago

One other thing I realized we'll need to do for this: Set a limit for how many characters to paste before adding an ellipsis. I think 25 characters seems like a reasonable limit.

cnagda commented 4 years ago

Foreach topic: for unique(meta_column): mean(topic_doc score), sort descending, paste(top 5, sep=“< /br>”) —> cbind to LDA_vis table.

@sampizelo Should mean(topic_doc_score) be the sum of scores over the number non zero scores? Also wondering if doc_topics be normalized or smoothed before doing this

sampizelo commented 4 years ago

@cnagda that’s a good point, I forgot we’re working with a lot of zeroes this time. I like sum of scores over non-zero, because it allows for unrelated docs to be excluded, for instance with London there are so many that depending on how we score it, it could be first or last in every category. I’m not sure what normalization would involve, but I trust your instincts. If you think it will improve the info then let’s do it.

cnagda commented 4 years ago

@cnagda that’s a good point, I forgot we’re working with a lot of zeroes this time. I like sum of scores over non-zero, because it allows for unrelated docs to be excluded, for instance with London there are so many that depending on how we score it, it could be first or last in every category. I’m not sure what normalization would involve, but I trust your instincts. If you think it will improve the info then let’s do it.

@sampizelo I left them not normalized (each cell is number of words of each topic in document) and it does seem that summing over all vs. summing over nonzero means locations like London show up in every topic. Here's what it looks like for the first few topics.

Summing over all: top_locations V1 ["london","oxford","edinburgh","amsterdam","sl"] V2 ["london","oxford","cambridge","edinburgh","sl"] V3 ["london","oxford","cambridge","edinburgh","amsterdam"] V4 ["london","oxford","sl","edinburgh","savoy"] V5 ["london","cambridge","edinburgh","oxford","london ie savoy"] V6 ["london","oxford","cambridge","edinburgh","sl"]

Summing over nonzero: top_locations V1 ["villa franca","belfast","reading","rhemes ie london","leiden"] V2 ["dort","europe","doway ie england","paris ie saint-omer","doway ie lancashire"] V3 ["geneva","reims","marlborow ie antwerp","aire","southwark"] V4 ["np","antipides ie dordrecht","catuapoli ie douai","savoy ie london","london ie savoy"] V5 ["collen ie paris","paris ie saint-omer","striveling ie london","london ie savoy","delft ie london"] V6 ["reims","marlborow ie antwerp","geneva","southwark","aire"]

sampizelo commented 4 years ago

@cnagda Wow, that's awesome! You're definitely right - the second one seems way more useful. I'm really excited to see the final product.

cnagda commented 4 years ago

Topic model table for phase1+2 on datasci has columns for top authors, locations, keywords, publishers now. Left it in the form of a json array. Should be easy to add hover after database updates happen

cnagda commented 4 years ago

@sampizelo @avkoehl How should this be formatted? Way too big at the moment image image

sampizelo commented 4 years ago

What is this? A hover text for ants? It needs to be at least three times this size.

sampizelo commented 4 years ago

Ok so a couple of things: 1) This is so exciting to see! I can't wait for the final version. Great work on the implementation. 2) The keywords are actually delimited by "--", so you're actually displaying the concatenated keywords for each document, which is going to distort the final result by a lot (nearly every document will have a unique combination of 3 keyword categories). For my personal version, I created a new table that split each keyword string on "--" and then summed by unique keyword entry. We can have a conversation about this if I'm not explaining myself well. 3) The goal is to have buttons on the side that allow you to select one of these four options for hover text, and the text will change based on which option is selected. That should mostly resolve the massive text box issue. 4) There shouldn't be that many topics that have long lists of authors in the author category, so I'm inclined to leave it that way to keep full information available, but you could also limit to something like 80 characters and then add ellipses "..." if it's out of hand in a lot of topics. Same goes for publisher. Limit to 80 characters and see what it looks like.

cnagda commented 4 years ago

Thanks! That should fix everything. I definitely missed some code and didn't split the keywords but I think I found it and it should be an easy fix. The publisher list seems to get a lot longer so the 80 character limit should do the trick. I'll also put some buttons beneath the search boxes for now.

sampizelo commented 4 years ago

Awesome!

cnagda commented 4 years ago

Working with ugly radio buttons now. This is what it looks like: image

sampizelo commented 4 years ago

Wow, yeah, this looks great!