Closed manuelbickel closed 6 years ago
After having discussed the current version of this PR, I can add some additional metrics, e.g., NPMI
or cosine similarity of top terms NPMI
.
@dselivanov Travis highlights an error in 1. Error: idir (@test-iterators.R#53)
. Not sure what I did wrong, maybe a mistake when creating the branch or so? I have not touched any original files and just added coherence.R
and coherence-test.R
. Maybe you have an idea where I went wrong and can point me at a solution?
I don't think it is somehow related to the PR. I will try to check later today or tomorrow. I'm traveling again http://textworkshop18.ropensci.org :-)
Thanks for quick feedback and enjoy your trip. You might directly ask “David Mimno” (author of UMass metric paper, i.e. logratio) or “Brandon Stewart” (stm package implementation of this metric) about recent experience with coherence - they are certainly more up to date in the field than I am ;-)
I'm finally at home - will review tomorrow.
I hope you had an interseting trip and gained some insights, welcome back ;-). Please note that the example is not yet finished in terms of the issue of the number of sliding windows (#253), I will update the example accordingly (thanks for your explanation!), but the rest should be fine. You might also have a look at the tests showing the compliance with stm and textmineR implementations.
Thank you for reviewing and your excellent comments, as always ;-). I will incorporate your suggestions and check the results again. Due to my time schedule will probably only be able to start with that this weekend...
This is very interesting looking contribution. Coherence is based on Wikipedia word co-occurrence, if understand correctly, and will work well for the general case. Do you know whether this approach would be reliable if one is using topic modelling for a corpus that is highly domain specific, for example scientific abstracts of machine learning papers? I would imagine that word "learning" would co-occur in Wikipedia frequently with “teacher”, “class”, and “student” but in machine learning context such co-occurrences would not make much sense.
I am trying to understand the boundaries of this solution and whether I could use it on some of my domain specific corpuses.
Keep up the good work!
...first i want to apologize, i could not keep my promise to update the code, busy timew after moving house ;-).
regarding the coherence metrics: i have a corpus of 30,000 scientific abstracts in the field of "sustainable energy" and at the first sight, some metrics seem to propose a reasonable number of topics that allows to select suitable lda models (have not finished testing yet). coherence metrics are certainly not perfect, yet, but a start at least. you might have a look at the paper by röder et al. and the palmetto tool (java) - see discussion of the first (closed) PR. i hope that helps...
Am 7. Mai 2018 01:09:31 MESZ schrieb mmantyla notifications@github.com:
This is very interesting looking contribution. Coherence is based on Wikipedia word co-occurrence, if understand correctly, and will work well for the general case. Do you know whether this approach would be reliable if one is using topic modelling for a corpus that is highly domain specific, for example scientific abstracts of machine learning papers? I would imagine that word "learning" would co-occur in Wikipedia frequently with “teacher”, “class”, and “student” but in machine learning context such co-occurrences would not make much sense.
I am trying to understand the boundaries of this solution and whether I could use it on some of my domain specific corpuses.
Keep up the good work!
-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/dselivanov/text2vec/pull/252#issuecomment-386923956
-- sent via mobile - please excuse typos
I have incorporated all your proposals, thanks for the lesson in matrix algebra in R. I have compared the results to my previous version "by hand" via a mini-example and also via the values generated with the test data used in the testthat file, they are all.equal
.
I have currently kept the workaround to calculate the number of skip gram windows in the roxygen example, I hope my understanding is now correct based on your anwers in #253. This might be removed as soon as an automatic counter is introduced in create_tcm
.
Furthermore, I wanted to note that several additional files appeared for my commit (.cpp, .o., .h, etc.). I guess that these are simply updates from the master branch, but it seemed strange to me, that they appear in my commit, therefore, I wanted to highlight this. Sorry, in case I made a mistake - please tell me if any action is required from my side regarding this issue.
@manuelbickel thanks for update. Yes, those files from src*
dir are not needed. Please remove them and I will merge PR. (check here for example https://stackoverflow.com/a/38744242/1069256)
@dselivanov I hope my changes are fine now (sorry for the several reverst, I accidentally did one too much and had to re-revert...). Please let me know if there are any open issues I should resolve (in the future we might certainly update the example). In any case, thank you for your support!
I have also updated docs, which now include all currently implemented metrics. The order of listing might be changed, which is a matter of taste. For describing the more complex metrics (e.g. the ones using cosine similarity), I did not repeat all calcucation steps, but referred to the basic calculations, e.g., pmi, and just explained what is calculated further on this basis. I hope this is fine for your, otherwise, we might add more details.
Since measuring coherence is still under research, we might need some more experience to understand which metric makes sense in which context. From my understanding, e.g. logratio seems to favor a small number of topics, whereas, difference metric opts for higher numbers. Hence, this PR should be seen as a starting point rather than a final solution. I will share the experience/results of my current study as soon as finished...
@manuelbickel thanks for awesome work! Added you to authors list - https://github.com/dselivanov/text2vec/commit/2f510553a301fc20a4c962e60f34b019057481a0
@dselivanov Never have imagined to enter the authors list. I feel honoured, thank you. Of course, the final version would never have been as elegant without your support, I was grateful for your "R lessons". As soon as there is feedback and more experience with the metrics I will try to support the updating process of coherence
.
Having discussed the initial approach for coherence in the first pull request, this is the second updated pull request to restart review, etc. on a cleaner basis.
Please note that the testfile is not yet completed and not executed with thestthat, yet. Will do that in the next days... However, it already includes a direct comparison of two coherence metrics with the results from implementations in stm and textmineR package showing that the results are equal - at least, for one test case, might have to do some more parameter variations.