Closed avkoehl closed 3 years ago
closely related to #31
Foreach topic: for unique(meta_column): mean(topic_doc score), sort descending, paste(top 5, sep=“< /br>”) —> cbind to LDA_vis table.
Column for the following: 1) Top 5 Authors/Topic 2) Top 5 Locations/Topic 3) Top 5 Keywords/Topic 4) Top 5 Publishers/Topic
One other thing I realized we'll need to do for this: Set a limit for how many characters to paste before adding an ellipsis. I think 25 characters seems like a reasonable limit.
Foreach topic: for unique(meta_column): mean(topic_doc score), sort descending, paste(top 5, sep=“< /br>”) —> cbind to LDA_vis table.
@sampizelo Should mean(topic_doc_score) be the sum of scores over the number non zero scores? Also wondering if doc_topics be normalized or smoothed before doing this
@cnagda that’s a good point, I forgot we’re working with a lot of zeroes this time. I like sum of scores over non-zero, because it allows for unrelated docs to be excluded, for instance with London there are so many that depending on how we score it, it could be first or last in every category. I’m not sure what normalization would involve, but I trust your instincts. If you think it will improve the info then let’s do it.
@cnagda that’s a good point, I forgot we’re working with a lot of zeroes this time. I like sum of scores over non-zero, because it allows for unrelated docs to be excluded, for instance with London there are so many that depending on how we score it, it could be first or last in every category. I’m not sure what normalization would involve, but I trust your instincts. If you think it will improve the info then let’s do it.
@sampizelo I left them not normalized (each cell is number of words of each topic in document) and it does seem that summing over all vs. summing over nonzero means locations like London show up in every topic. Here's what it looks like for the first few topics.
Summing over all: top_locations V1 ["london","oxford","edinburgh","amsterdam","sl"] V2 ["london","oxford","cambridge","edinburgh","sl"] V3 ["london","oxford","cambridge","edinburgh","amsterdam"] V4 ["london","oxford","sl","edinburgh","savoy"] V5 ["london","cambridge","edinburgh","oxford","london ie savoy"] V6 ["london","oxford","cambridge","edinburgh","sl"]
Summing over nonzero: top_locations V1 ["villa franca","belfast","reading","rhemes ie london","leiden"] V2 ["dort","europe","doway ie england","paris ie saint-omer","doway ie lancashire"] V3 ["geneva","reims","marlborow ie antwerp","aire","southwark"] V4 ["np","antipides ie dordrecht","catuapoli ie douai","savoy ie london","london ie savoy"] V5 ["collen ie paris","paris ie saint-omer","striveling ie london","london ie savoy","delft ie london"] V6 ["reims","marlborow ie antwerp","geneva","southwark","aire"]
@cnagda Wow, that's awesome! You're definitely right - the second one seems way more useful. I'm really excited to see the final product.
Topic model table for phase1+2 on datasci has columns for top authors, locations, keywords, publishers now. Left it in the form of a json array. Should be easy to add hover after database updates happen
@sampizelo @avkoehl How should this be formatted? Way too big at the moment
What is this? A hover text for ants? It needs to be at least three times this size.
Ok so a couple of things: 1) This is so exciting to see! I can't wait for the final version. Great work on the implementation. 2) The keywords are actually delimited by "--", so you're actually displaying the concatenated keywords for each document, which is going to distort the final result by a lot (nearly every document will have a unique combination of 3 keyword categories). For my personal version, I created a new table that split each keyword string on "--" and then summed by unique keyword entry. We can have a conversation about this if I'm not explaining myself well. 3) The goal is to have buttons on the side that allow you to select one of these four options for hover text, and the text will change based on which option is selected. That should mostly resolve the massive text box issue. 4) There shouldn't be that many topics that have long lists of authors in the author category, so I'm inclined to leave it that way to keep full information available, but you could also limit to something like 80 characters and then add ellipses "..." if it's out of hand in a lot of topics. Same goes for publisher. Limit to 80 characters and see what it looks like.
Thanks! That should fix everything. I definitely missed some code and didn't split the keywords but I think I found it and it should be an easy fix. The publisher list seems to get a lot longer so the 80 character limit should do the trick. I'll also put some buttons beneath the search boxes for now.
Awesome!
Working with ugly radio buttons now. This is what it looks like:
Wow, yeah, this looks great!
For the topic model visualization, when hovering over a topic the following information should be displayed (we may have some setting to determine which of these information you actually want to see): topic proportion as percent of corpus top key words associated with the topic top authors associated with the topic top words for that topic topic cluster (if available)