Closed timm closed 8 years ago
Natural Language Processing is no free lunch either
https://github.com/ds4se/chapters/blob/master/wagnerst/text-mining.md
What is the chapter's clear and approachable take away message?
Natural language processing, while appealing, will not yield satisfying results unless specific steps are followed.
Since the chapter's focus is not on very technical content, it is very accessible. The only technical terminology used is very basic (stemming, clustering, clones, etc), which can be pretty much taken for granted.
The only thing (and I'll come to that later), is that some concepts are touched upon quite fast, and could be expanded upon by some examples.
Is the chapter the right length?
The chapter is approximately the right length, so no issue there. Perhaps it's a bit shorter than the limit.
Should anything missing be added?
While I like the four good practices to follow, I think they are perhaps too high level. I wonder if it would be a good idea to make them more specific. E.g., in the case of stemming, the recommendation boils down to "use stemming". To make it more concrete is there a particular stemming algorithm that should be used (or a list of say 3)?
Likewise, I really miss a concrete example of how clustering can help checking for the level of abstraction. Should this clustering be done based on word frequencies, on something else? This is important since the chapter explicitly comes back to this recommendation in the conclusion.
The same applies to the manual analysis of the data: more concrete examples would make the recommendation much more useful, and would help readers getting started more easily. If anything, it seems to me this recommendation could take more space in the chapter, should space be needed, as it is the most insightful (or so it seems to me).
Basically the chapter gives good pointers to these topics, but in some cases it's not clear what to expect from the reference based on just reading the chapter text, which makes it less likely that people will read the reference.
Can anything superfluous be removed (e.g. by deleting some section that does not work so well or by using less jargon, less formulae, lees diagrams, less references).?
Not really. Perhaps the introduction could be streamlined a bit. The recommendation to use stemming is perhaps really basic, so in the event that more space is needed, it can be shortened.
What are the aspects of the chapter that authors SHOULD change?
As said above, expanding on some of the good practices by means of more concrete recommendations or examples.
We encouraged (but did not require) the chapter title to be a mantra or something cute/catchy, i.e., some slogan reflecting best practice for data science for SE? If you have suggestion for a better title, please put them here.
The title is catchy enough. Although since the main recommendation to me is the manual analysis, perhaps putting it in the title could help.
What are the best points of the chapter that the authors should NOT change?
The good practices presented in the chapter are useful, especially the last one. And the initial example drives the point that "just applying" NLP out of the box gives something far from the initial goal that was envisioned by the researchers.
Perhaps an additional good reference would be Dave Binkley's paper at ICPC 2014 on LDA: David Binkley, Daniel Heinz, Dawn J. Lawrie, Justin Overfelt: Understanding LDA in source code analysis. ICPC 2014: 26-36
It has good discussions of the parameters of LDA and their impact on the results.
There's a typo in the second reference: "... In In Christian Bird, ... " => "... In Christian Bird, ...
Before filling in this review, please read our Advice to Reviewers.
(If you have confidential comments about this chapter, please email them to one of the book editors.)
Natural language processing is no free lunch either
the markdown file.
https://github.com/ds4se/chapters/blob/master/wagnerst/text-mining.md
What is the chapter's clear and approachable take away message?
NLP still requires skills
Is the chapters written for a generalist audience (no excessive use of technical terminology) with a minimum of diagrams and references? How can it be made more accessible to generalist?
Maybe not so generalist, but general for the develpers we are talking too
Is the chapter the right length? Should anything missing be added? Can anything superfluous be removed (e.g. by deleting some section that does not work so well or by using less jargon, less formulae, lees diagrams, less references).? What are the aspects of the chapter that authors SHOULD change?
ok
We encouraged (but did not require) the chapter title to be a mantra or something cute/catchy, i.e., some slogan reflecting best practice for data science for SE? If you have suggestion for a better title, please put them here.
didn't like the title. How about "Just how hard is Natural Language Processing?"
then start with a para lile
In this era of Siri and Deep Learning and Watson, it may seem that all the technical problems of natural language processing are solved. In this chapter, we offer a brief sanity check on any such claim.
Our task here is not understanding day-to-day chatter (which is the task of Siri). Rather, we explore a harder task that is a core issue in software engineering-- the documentation of complex code.
What are the best points of the chapter that the authors should NOT change?
approachable
@wagnerst Please take a look at the reviews and prepare a new version of the chapter by January 13. In particular, focus on adding more details (as suggested by @rrobbes) and consider the proposed title/intro (suggested by @timm). Even if you don't change the title, I like the idea of connected NLP to Siri, Cortana, Watson, etc.
Thanks to all of you for the great feedback! I tried to incorporate most of it. I think (and hope) I addressed all issues of @rrobbes. I also incorporated the intro idea from @timm. But I have not changed the title. I'm not particularly happy with it, but I didn't really like @timm's idea. And I couldn't come up with something better.
Thanks @wagnerst for the revision. It looks good to me.
After review, relabel to 'reviewTwo'. After second review, relabel to 'EditorsComment'.