datalab-dev / quintessence_analysis

All the scripts we use for analysis
0 stars 0 forks source link

Create ldavis for the model with 75 topics and 90 topics #4

Closed avkoehl closed 4 years ago

avkoehl commented 4 years ago

Motivation

We want to be able to look at the outputs of the two models we just ran.

Task

Use malletparse or RMallet to read the results of the topic model. Then create the ldavisjson required (try to not rearrange topics by size). Add to the directory hosted at: https://datalab.ucdavis.edu/text-reports/archive_text_reports/quintessence/lda/

avkoehl commented 4 years ago

Next step will be creating the topic model tables in sql when we have decided on a model.

cnagda commented 4 years ago

Don't have permissions to add to the directory

avkoehl commented 4 years ago

Don't have permissions to add to the directory

Should be good now

cnagda commented 4 years ago

Use malletparse or RMallet to read the results of the topic model. Then create the ldavisjson required (try to not rearrange topics by size). Add to the directory hosted at: https://datalab.ucdavis.edu/text-reports/archive_text_reports/quintessence/lda/

Added ldavis for 75 and 90 topics

avkoehl commented 4 years ago

@sampizelo What do we think? 90 or 75 topics?

sampizelo commented 4 years ago

@avkoehl I think 75 for now. It wasn't quite as "optimal" as 90, but it still was a clear breakpoint, and is going to be much less visually cluttered and I think will make a lot more sense for people who are newer to topic models.

sampizelo commented 4 years ago

@avkoehl And looking at the LDAvis now... is it possible to just not show topic 59 at all, and make it a 74-topic model? I'm not sure what difficulties that would cause, but it's all in Latin, so we don't really want to see it anyway, and it's hugely skewing our plot. (Also FWIW - they aren't just Latin words in general, but Latin stop words specifically - est, qui, quod, cum, hoc, etc. We would still be leaving some other topics with useful Latin words in them).

sampizelo commented 4 years ago

I can also rerun the clustering optimizer on topic terms without #59 and see if that changes anything.

sampizelo commented 4 years ago

For future reference (not sure where to put this) - looks like snowball supports Latin stopwords as well: https://www.rdocumentation.org/packages/stopwords/versions/1.0. We should consider incorporating Latin, French, Irish, Scottish, and German stopword filters into our workflow at some point in the future (not a priority).