Code blocks in message.text

meyersbs commented 7 years ago

Problem

There are a lot of newlines \n (and carriage returns \r) within the message.text field; my first instinct was to strip these out, however, in doing so, I realized that there are blocks of code in the message.text field. (duh!) Stripping out the newlines and carriage returns results in code blocks losing their structure and lots of r'\s+'.

We should remove code blocks from message.text (or at least ignore them during analysis), but how do we do that programatically?

Example

Raw

    # message.id = 4754736
    LGTM whatever way you choose, though.

    The pattern we've been doing elsewhere is this:
        if '-Wall' in env['CCFLAGS']:
            env['CCFLAGS'].remove('-Wall')
    Might be simpler.  I don't care too much, though you also might want to make this fit into 80 chars...

Stripped `r'\\r\\n'`

    # message.id = 4754736
    LGTM whatever way you choose, though. The pattern we've been doing elsewhere is this:   if '-Wall' in env['CCFLAGS']:     env['CCFLAGS'].remove('-Wall') Migh
t be simpler.  I don't care too much, though you also might want to make this fit into 80 chars...

Proposed Solution

My gut reaction in situations like this is to employ a regex, but unless someone has gone through the pain of creating a regex for every (or at least the most common) programming languages, I don't see how that's possible.

If we could detect code blocks within the natural language, we could just replace them with a special token (something like <<CODE>>) to make them easier to ignore when parsing.

nuthanmunaiah commented 7 years ago

We could use Pygments which is the syntax highlighter used by GitHub. Pygments is essentially a collection of lexers and for us to use a lexer to identify source code in code review messages, we will need to deduce the programming language the code snippet is written in. I don't think there is an accurate way of identifying the programming language. Furthermore, I don't know how Pygments will behave when given code sandwiched between natural language. I think using Pygments would be an over-engineering of the linsights project. Unless we have a simple and straightforward way of replacing code snippets with a tag like <<CODE>>, we should ignore the fact that code review messages could have code snippets in them.

meyersbs commented 7 years ago

I think it's important that we replace code snippets with <<CODE>>, as the keywords in most programming languages are regular English words, which will skew calculations of TF-IDF. For more complicated metrics, such as syntactic complexity, which rely on parsing the syntactic tree of a sentence, feeding code snippets into the the tree parser will undoubtedly produce strange trees that skew results.

I'm going to see if I can find any NLP papers dealing with Natural Language and Programming Language and see what they have done to solve this problem, but I have a feeling that (if there are papers like this) the dataset was small enough to manually replace code snippets.

meyersbs commented 7 years ago

Updates

There are numerous papers (e.g. this) that use NLP (typically Naive Bayes classifiers) to determine the programming language of a text snippet, but it seems the underlying assumption with these works is that the text snippet will only contain code, not natural language (aside from comments). We need a way to determine which sentences in a message contain source code so we can filter these out of our analyses.

For example, we (as humans) know that lines 6-17 below contain source code (message.id=2157654). Line 13 has an Yngve score of 3.36, which is the Max Yngve for the message. If we could programmatically determine that lines 6-17 are source code, the Max Yngve for the natural language in the message is 2.67.

Line	Sentence
1	LGTM overall,
2	The push/pop between null check and global_context_map check can be eliminated.
3	http://codereview.chromium.org/8039/diff/1/4
4	File src/macro-assembler-arm.cc (right):
5	http://codereview.chromium.org/8039/diff/1/4#newcode698
6	Line 698: mov(ip, holder_reg); // Restore ip.
7	Restore ip in the middle seems unnecessary because Check failure halts
8	execution. Something simpler:
9	push(holder_reg);
10	mov(hodler_reg, ip);
11	cmp(holder_reg, ... null_value() ... );
12	Check
13	ldr(holder_reg, FieldMemOperand(holder_reg, HeapObject::kMapOffset);
14	cmp(holder_reg, ... global_context_map() ...);
15	Check(...);
16	pop(holder_reg);
17	ldr(ip, FieldMemOperand(holder_reg, ...));
18	what do you think?

Non-Solutions

We could adapt Pygments to detect which programming language is present in a message, but that only tells us that there is code, not where it is. We could try to use Pygments to detect the language for a single sentence, but that will likely result in false-positives and -negatives.
We could develop a set of regexes that determine whether or not a sentence contains source code -- except that would be nearly impossible as Chromium uses many programming languages, many of which (arguably) have irregular grammars. Additionally, keywords are almost always valid English words.

Potential Solutions

We could manually tag sentences that contain source code. However, besides that taking forever, that would require post-processing to remove tokens from those sentences from the DB.
We could manually tag a sample of sentences with labels (natural and code) and train a classifier, but how many sentences do we need for that? Is it worth our time? And I'm sure you'll agree that you're just as sick of classifiers as I am.
This is the easy solution: don't. Write it off as a threat to validity. I don't like this solution.

meyersbs commented 7 years ago

Updates

By manually inspecting 600 sentences, I found 51 noteworthy instances where the tokens within the sentence were a mix of natural language and source code (see examples below). To be clear, I only documented 51 of the interesting instances; there are more than that, and there are a number of sentences consisting of only source code.

Examples

ID	Text
13	I hate FileVersionInfo.
16	if (item->isSupportedFormat() && (foundSVGFont \|\| (m_document && m_document->settings()->remoteFontEnabled()))) {
37	But we need that to set the got_revision, at least.
47	self.directory_parent = os.path.dirname(self.directory)
50	Maybe change this function name to GetCommandId() or similar?
67	I've updated the patch slightly, in order to fix this not working on Youtube (which was caused by the cache on WebCore::Page* objects, which remain the same after a refresh).
83	can't you /10 instead of *10 ?
88	can you create a new variable called "ten_percent" or something like that.
94	here you are checking if the *2.5Default** is bigger than 1%.
95	If it's not, you return *1.5Default**.
99	*size += kDefaultCacheSize 5 / 2;**.
100	Otherwise maybe you meant 5/2 when you computer "size", not 3/2.
104	I couldn't really find a coherent sorting to theme_resources.grd, so I just added them at the bottom.
109	The reason for reverting is: The HeapStatistics::total_available_size returned by v8 API doesn't return the correct value right now.
128	I had to merge debugger_remote_service.cc because it would not pass "gcl try" - someone has probably obsoleted WebContents in favor of TabContents, so this patchset involves this merge, too.
181	Why not take some of your longer conditions here (like command==kDebuggerCommand and command == kEvaluateJavascript) and extract them to helper functions?
393	I don't have a good name, but how about MockKeyboardDriverFoo (Win, Mac and Linux)?

Conclusion

51/600 is 8.5%, which is significant; We need to filter out this noise. Nuthan and I discussed a potential process of feeding each sentence to Pygments (and identifying which language(s) are likely to be present in a sentence based on a filetype--review mapping) to determine which sequences of tokens need to be replaced with some unique identifier, such as <<CODE>>. We're going to think on this.

andymeneely / sira-nlp