dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
852 stars 136 forks source link

Topic coherence #229

Closed dselivanov closed 6 years ago

dselivanov commented 6 years ago

Implement measures for topic coherence. For example see introduction here.

Proposals are welcome.

manuelbickel commented 6 years ago

In the following some starting points in stm and textmineR packages (probably you already know all of them - if so, please skip or feel free to delete post) that seek to overcome the drawback of the measure of Mimno et al. (Intrinsic UMass measure) stated in above post, i.e., overrating of topics containing common words due to correlation based and not statistical dependence based approach. I have not compared the results of different approaches on a qualitative basis (hence, manual examination) yet, but I will share my results as soon as available (currently creating models on 30,000 scientific abstracts, my machine is not very fast...).

Tommy Jones proposed a measure in textmineR. See around lines 65-78 in CalcProbCoherence.R and his explanations to an issue I raised on his proposal for Probabilistic Topic Coherence.

The stm package uses another measure for semanticCoherence.R to encounter the drawbacks of the approach of Mimno et al. based on exclusivity.R and "frex", see lines 93-103 in STMfunctions.R, theoretical details see Bischof and Arioldi - Summarizing topical content with word frequency and exclusivity.

manuelbickel commented 6 years ago

Another summary on current approaches to coherence (from 2015) and including another approach based on normalized PMI Röder, Both, et al. - Exploring the Space of Topic Coherence Measures 10.1145/2684822.2685324 - is this accessible to you (I am currently accessing from within university network).

dselivanov commented 6 years ago

Thank you! I was able to download article.

8 янв. 2018 г. 13:28 пользователь "manuelbickel" notifications@github.com написал:

Another summary on current approaches to coherence (from 2015) and including another approach based on normalized PMI Röder, Both, et al. - Exploring the Space of Topic Coherence Measures 10.1145/2684822.2685324 [Titel anhand dieser DOI in Citavi-Projekt übernehmen] http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf - is this accesible to you (I am currently accessing from within university network).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dselivanov/text2vec/issues/229#issuecomment-355917293, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4u3XnkcGyIxAFattXSzhQcar5fbCpFks5tId-tgaJpZM4RVk6A .

manuelbickel commented 6 years ago

Furthermore, in this context, this study of Lau/Newman/Baldwin of NMPI results might be interesting Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality.

manuelbickel commented 6 years ago

Found something helpful (hopefully), the Java based palmetto tool which has implemented several coherence measures. Using the documentation provided for the functions of the tool to calculate various coherence measures may support implementation in R. This might be based on above linked implementations in stm package or textmineR (need to be parallelized still).

manuelbickel commented 6 years ago

I am currently trying to implement some coherence measures on basis of the paper by Röder (Exploring the space...). My current attempt is on github as calc_coherence.R inlcuding some preliminary test_results.

I am not sure if I have implemented the asymmetric wiwj combination correctly. At least, results slightly differ from stm package. Further, current attempt only includes S_one_one subsets (pairs of single words), no S_one_any, etc. (word vs. word vector, etc.). Some ideas on how to approach this are inlcuded as comments in calc_coherence.

Since I am not a professional nor a trained programmer (or similar) it is certainly not the most elegant solution and there might still be some errors (especially concerning the odering of wi wj indices). Just wanted to provide this as a potential starting point.

dselivanov commented 6 years ago

Thanks for sharing! Let me try to take a look (probably on weekend) to the paper and your code. After that we can figure out the best way how to proceed.

manuelbickel commented 6 years ago

Sounds good! If, at the first look, you think that I should make some adaptions before you dive deeper, let me know and I will do what I can (want to spare your time). I would understand if you would completely rewrite the whole thing given your expertise, but, at least, it might give you an idea how to generally approach topic coherence. I will introduce an additional short comment at the beginning of the code that explains the general logic (hence, some kind of summary of the paper regarding steps of programming).

dselivanov commented 6 years ago

Hi @manuelbickel. I've started with article, but it seems it will require more time to understand methodology there. But I very like that it tries to generalize approaches taken before. General comments regarding coding:

  1. It make sense to decompose function into many smaller functions
  2. I think it is better to allow only one type of the input
    • co-occurrence matrix (doesn't matter whether it is skip-gram or document level co-occurrence)
    • require beta to be a top N words probability matrix
    • provide "utility" functions to get co-occurrence matrix and beta in a desired format from "native" formats of other packages (text2vec, stm , tm, ...). This is what is done at the beginning of the calc_coherence function.
  3. prefer code readability over a speed - if functions are not vectorized I suggest to prefer for loops over *apply family. It is more easy to reason about. Not everywhere, but where possible. Functions like this are good candidates. I'm sure in a couple of weeks it will be very hard to understand what is going on:
     do.call(rbind,
            sapply(2:length(idxs), function(x) {
              cbind(wi = rep(idxs[x], length(idxs[1:length(idxs[1:(x-1)])]))
                    ,wj = idxs[1:length(idxs[1:(x-1)])])
            }, USE.NAMES = FALSE))
      }
  4. avoid names like lgrat_UMassep.01. It will be better to have longer names and have 0.01 value as default smoothing parameter in a function.

Overall a lot of useful comments, great job! I will continue investigation.

Hope you will consider my comments as a friendly suggests, but not a criticism :-)

manuelbickel commented 6 years ago

Thank you for having a look! I really appreciate your comments, since I am just a "self-trained" R user, you can certainly teach me a lot ;-). I will try to incorporate the simplifications, separations, and improvements you have proposed so far as a next step and will update you accordingly.

manuelbickel commented 6 years ago

I have updated the calc_coherence function following your above comments. Still not perfect, but higher readability now and input formatting moved to separate functions (input is now a tcm and top term matrix as from get_top_words). Basic examples of output (code and results as comments) are contained in calc_coherence_check_second_version; also contains a partial validation of results for UMass against results produced by stm package. Other measures might still have to be validated differently.

dselivanov commented 6 years ago

@manuelbickel looks good! Would be great if you can create pull request. It will be much easier to proceed with code review, etc.

dselivanov commented 6 years ago

Closed by #252