datalab-dev / quintessence_analysis

All the scripts we use for analysis
0 stars 0 forks source link

Create table for topic proportions #6

Closed avkoehl closed 4 years ago

avkoehl commented 4 years ago

Motivation

We need to update the topic proportions for the new topic models graph. The topic proportions are simply the topic proportions for the full corpus computed using ldavis' method:

topic_freq = colSums(doc_topics * doc_lens)
topic_proportions =  topic_freq / sum(topic_freq)

Task

cnagda commented 4 years ago

topic_proportion table is updated in datasci now. I added the code for it to mallet_tm_to_sql.R. Same as code above but didn't multiply by doclen because mallet doc_topics is already freq instead of proportion

cnagda commented 4 years ago
Doesn't match ldavis: topic proportion ldavis
V1 0.0140469427705953 0.006429357
V2 0.0144728679318146 0.01230919
V3 0.0237056319470436 0.02251543
V4 0.014286761103971 0.007561433
V5 0.0208729339364131 0.04697035
V6 0.0201209374707088 0.02219469
V7 0.00881664336638489 0.01590796
V8 0.00932888244515094 0.007222459
V9 0.00640227383844603 0.003496074
V10 0.0141274766066537 0.02299951
V11 0.00770791364010918 0.01199015
V12 0.0108863706205394 0.01918798
V13 0.0138755790825756 0.009039159
V14 0.0264225022362821 0.01703177
V15 0.0074993038147429 0.007228903
V16 0.0254177255061635 0.01867842
V17 0.0301945578574636 0.02880095
V18 0.00698947703529193 0.005128058
V19 0.00682013047775079 0.007520889
V20 0.0133322651847597 0.01003781
V21 0.0144454675931395 0.03327913
V22 0.00863066826148683 0.005538213
V23 0.0188029255422406 0.01187994
V24 0.018775672654624 0.01475935
V25 0.0118872389146039 0.01043374
V26 0.0099303392443068 0.008802691
V27 0.0102712244651236 0.009430731
V28 0.00403399947737054 0.01143916
V29 0.0216767171798342 0.01707382
V30 0.00937696328626714 0.008478399
V31 0.0117284144645887 0.009556778
V32 0.00758063388651649 0.01035187
V33 0.0140774435135245 0.01853502
V34 0.0171806198476029 0.01809397
V35 0.0174895042567233 0.009598752
V36 0.0100248022903467 0.01595963
V37 0.00884071131114055 0.004460869
V38 0.00894147543240849 0.007666314
V39 0.0141675832945327 0.01131902
V40 0.0079482549332386 0.007224388
V41 0.00962150005351097 0.007210551
V42 0.0172221971140376 0.0190041
V43 0.0171639952322742 0.01888237
V44 0.0138349081486464 0.01049481
V45 0.00987995423465166 0.009441511
V46 0.00586454342070743 0.01068514
V47 0.0195939237958932 0.02035888
V48 0.019128562357607 0.008396824
V49 0.0169857760530469 0.009945937
V50 0.0280600976226666 0.03256612
V51 0.00994556209157384 0.005294785
V52 0.00575317855134911 0.004961739
V53 0.0124979870964517 0.01423823
V54 0.00780712657624297 0.005786062
V55 0.0141625994487595 0.00774025
V56 0.0178488896703314 0.01631112
V57 0.018510526161447 0.02342016
V58 0.00822112786187464 0.00364913
V59 0.0189233380085411 0.01907004
V60 0.0125378126443128 0.00815129
V61 0.0163592859962065 0.03461295
V62 0.0243410418099088 0.02541088
V63 0.0228920481232756 0.01417038
V64 0.00606018541699281 0.01422526
V65 0.00702214825780047 0.007624669
V66 0.00938867090030243 0.006364135
V67 0.0236325394743981 0.02141724
V68 0.008949614730831 0.01490856
V69 0.00154656106895553 0.001021729
V70 0.0041829840267566 0.003633223
V71 0.008558643334504 0.008490964
V72 0.00894664998155053 0.005604691
V73 0.00504183335943283 0.003452381
V74 0.0158015435208247 0.01422761
V75 0.012574807131853 0.02309395
cnagda commented 4 years ago

@avkoehl createJSON() expects doc topics and topic terms to be normalized so that all rows sum to 1. After doing this, the topic proportions we calculated match ldavis

avkoehl commented 4 years ago

Okay, perfect, go ahead and write to the table if you havent already. Once the table for topic proportions has been overwritten with these new valeus on datasci, go ahead and close this issue!