dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
850 stars 135 forks source link

Coherence (PR2) #252

Closed manuelbickel closed 6 years ago

manuelbickel commented 6 years ago

Having discussed the initial approach for coherence in the first pull request, this is the second updated pull request to restart review, etc. on a cleaner basis.

Please note that the testfile is not yet completed and not executed with thestthat, yet. Will do that in the next days... However, it already includes a direct comparison of two coherence metrics with the results from implementations in stm and textmineR package showing that the results are equal - at least, for one test case, might have to do some more parameter variations.

manuelbickel commented 6 years ago

After having discussed the current version of this PR, I can add some additional metrics, e.g., NPMI or cosine similarity of top terms NPMI.

manuelbickel commented 6 years ago

@dselivanov Travis highlights an error in 1. Error: idir (@test-iterators.R#53). Not sure what I did wrong, maybe a mistake when creating the branch or so? I have not touched any original files and just added coherence.R and coherence-test.R. Maybe you have an idea where I went wrong and can point me at a solution?

dselivanov commented 6 years ago

I don't think it is somehow related to the PR. I will try to check later today or tomorrow. I'm traveling again http://textworkshop18.ropensci.org :-)

manuelbickel commented 6 years ago

Thanks for quick feedback and enjoy your trip. You might directly ask “David Mimno” (author of UMass metric paper, i.e. logratio) or “Brandon Stewart” (stm package implementation of this metric) about recent experience with coherence - they are certainly more up to date in the field than I am ;-)

dselivanov commented 6 years ago

I'm finally at home - will review tomorrow.

manuelbickel commented 6 years ago

I hope you had an interseting trip and gained some insights, welcome back ;-). Please note that the example is not yet finished in terms of the issue of the number of sliding windows (#253), I will update the example accordingly (thanks for your explanation!), but the rest should be fine. You might also have a look at the tests showing the compliance with stm and textmineR implementations.

manuelbickel commented 6 years ago

Thank you for reviewing and your excellent comments, as always ;-). I will incorporate your suggestions and check the results again. Due to my time schedule will probably only be able to start with that this weekend...

mmantyla commented 6 years ago

This is very interesting looking contribution. Coherence is based on Wikipedia word co-occurrence, if understand correctly, and will work well for the general case. Do you know whether this approach would be reliable if one is using topic modelling for a corpus that is highly domain specific, for example scientific abstracts of machine learning papers? I would imagine that word "learning" would co-occur in Wikipedia frequently with “teacher”, “class”, and “student” but in machine learning context such co-occurrences would not make much sense.

I am trying to understand the boundaries of this solution and whether I could use it on some of my domain specific corpuses.

Keep up the good work!

manuelbickel commented 6 years ago

...first i want to apologize, i could not keep my promise to update the code, busy timew after moving house ;-).

regarding the coherence metrics: i have a corpus of 30,000 scientific abstracts in the field of "sustainable energy" and at the first sight, some metrics seem to propose a reasonable number of topics that allows to select suitable lda models (have not finished testing yet). coherence metrics are certainly not perfect, yet, but a start at least. you might have a look at the paper by röder et al. and the palmetto tool (java) - see discussion of the first (closed) PR. i hope that helps...

Am 7. Mai 2018 01:09:31 MESZ schrieb mmantyla notifications@github.com:

This is very interesting looking contribution. Coherence is based on Wikipedia word co-occurrence, if understand correctly, and will work well for the general case. Do you know whether this approach would be reliable if one is using topic modelling for a corpus that is highly domain specific, for example scientific abstracts of machine learning papers? I would imagine that word "learning" would co-occur in Wikipedia frequently with “teacher”, “class”, and “student” but in machine learning context such co-occurrences would not make much sense.

I am trying to understand the boundaries of this solution and whether I could use it on some of my domain specific corpuses.

Keep up the good work!

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/dselivanov/text2vec/pull/252#issuecomment-386923956

-- sent via mobile - please excuse typos

manuelbickel commented 6 years ago

I have incorporated all your proposals, thanks for the lesson in matrix algebra in R. I have compared the results to my previous version "by hand" via a mini-example and also via the values generated with the test data used in the testthat file, they are all.equal.

I have currently kept the workaround to calculate the number of skip gram windows in the roxygen example, I hope my understanding is now correct based on your anwers in #253. This might be removed as soon as an automatic counter is introduced in create_tcm.

Furthermore, I wanted to note that several additional files appeared for my commit (.cpp, .o., .h, etc.). I guess that these are simply updates from the master branch, but it seemed strange to me, that they appear in my commit, therefore, I wanted to highlight this. Sorry, in case I made a mistake - please tell me if any action is required from my side regarding this issue.

dselivanov commented 6 years ago

@manuelbickel thanks for update. Yes, those files from src* dir are not needed. Please remove them and I will merge PR. (check here for example https://stackoverflow.com/a/38744242/1069256)

manuelbickel commented 6 years ago

@dselivanov I hope my changes are fine now (sorry for the several reverst, I accidentally did one too much and had to re-revert...). Please let me know if there are any open issues I should resolve (in the future we might certainly update the example). In any case, thank you for your support!

I have also updated docs, which now include all currently implemented metrics. The order of listing might be changed, which is a matter of taste. For describing the more complex metrics (e.g. the ones using cosine similarity), I did not repeat all calcucation steps, but referred to the basic calculations, e.g., pmi, and just explained what is calculated further on this basis. I hope this is fine for your, otherwise, we might add more details.

Since measuring coherence is still under research, we might need some more experience to understand which metric makes sense in which context. From my understanding, e.g. logratio seems to favor a small number of topics, whereas, difference metric opts for higher numbers. Hence, this PR should be seen as a starting point rather than a final solution. I will share the experience/results of my current study as soon as finished...

dselivanov commented 6 years ago

@manuelbickel thanks for awesome work! Added you to authors list - https://github.com/dselivanov/text2vec/commit/2f510553a301fc20a4c962e60f34b019057481a0

manuelbickel commented 6 years ago

@dselivanov Never have imagined to enter the authors list. I feel honoured, thank you. Of course, the final version would never have been as elegant without your support, I was grateful for your "R lessons". As soon as there is feedback and more experience with the metrics I will try to support the updating process of coherence.