Open meyersbs opened 7 years ago
We could use Pygments which is the syntax highlighter used by GitHub. Pygments is essentially a collection of lexers and for us to use a lexer to identify source code in code review messages, we will need to deduce the programming language the code snippet is written in. I don't think there is an accurate way of identifying the programming language. Furthermore, I don't know how Pygments will behave when given code sandwiched between natural language. I think using Pygments would be an over-engineering of the linsights project. Unless we have a simple and straightforward way of replacing code snippets with a tag like <<CODE>>
, we should ignore the fact that code review messages could have code snippets in them.
I think it's important that we replace code snippets with <<CODE>>
, as the keywords in most programming languages are regular English words, which will skew calculations of TF-IDF. For more complicated metrics, such as syntactic complexity, which rely on parsing the syntactic tree of a sentence, feeding code snippets into the the tree parser will undoubtedly produce strange trees that skew results.
I'm going to see if I can find any NLP papers dealing with Natural Language and Programming Language and see what they have done to solve this problem, but I have a feeling that (if there are papers like this) the dataset was small enough to manually replace code snippets.
There are numerous papers (e.g. this) that use NLP (typically Naive Bayes classifiers) to determine the programming language of a text snippet, but it seems the underlying assumption with these works is that the text snippet will only contain code, not natural language (aside from comments). We need a way to determine which sentences in a message contain source code so we can filter these out of our analyses.
For example, we (as humans) know that lines 6-17
below contain source code (message.id=2157654
). Line 13
has an Yngve score of 3.36
, which is the Max Yngve for the message. If we could programmatically determine that lines 6-17
are source code, the Max Yngve for the natural language in the message is 2.67
.
Line | Sentence |
---|---|
1 | LGTM overall, |
2 | The push/pop between null check and global_context_map check can be eliminated. |
3 | http://codereview.chromium.org/8039/diff/1/4 |
4 | File src/macro-assembler-arm.cc (right): |
5 | http://codereview.chromium.org/8039/diff/1/4#newcode698 |
6 | Line 698: mov(ip, holder_reg); // Restore ip. |
7 | Restore ip in the middle seems unnecessary because Check failure halts |
8 | execution. Something simpler: |
9 | push(holder_reg); |
10 | mov(hodler_reg, ip); |
11 | cmp(holder_reg, ... null_value() ... ); |
12 | Check |
13 | ldr(holder_reg, FieldMemOperand(holder_reg, HeapObject::kMapOffset); |
14 | cmp(holder_reg, ... global_context_map() ...); |
15 | Check(...); |
16 | pop(holder_reg); |
17 | ldr(ip, FieldMemOperand(holder_reg, ...)); |
18 | what do you think? |
We could adapt Pygments to detect which programming language is present in a message, but that only tells us that there is code, not where it is. We could try to use Pygments to detect the language for a single sentence, but that will likely result in false-positives and -negatives.
We could develop a set of regexes that determine whether or not a sentence contains source code -- except that would be nearly impossible as Chromium uses many programming languages, many of which (arguably) have irregular grammars. Additionally, keywords are almost always valid English words.
We could manually tag sentences that contain source code. However, besides that taking forever, that would require post-processing to remove tokens from those sentences from the DB.
We could manually tag a sample of sentences with labels (natural
and code
) and train a classifier, but how many sentences do we need for that? Is it worth our time? And I'm sure you'll agree that you're just as sick of classifiers as I am.
This is the easy solution: don't. Write it off as a threat to validity. I don't like this solution.
By manually inspecting 600
sentences, I found 51
noteworthy instances where the tokens within the sentence were a mix of natural language and source code (see examples below). To be clear, I only documented 51
of the interesting instances; there are more than that, and there are a number of sentences consisting of only source code.
ID | Text |
---|---|
13 | I hate FileVersionInfo. |
16 | if (item->isSupportedFormat() && (foundSVGFont || (m_document && m_document->settings()->remoteFontEnabled()))) { |
37 | But we need that to set the got_revision, at least. |
47 | self.directory_parent = os.path.dirname(self.directory) |
50 | Maybe change this function name to GetCommandId() or similar? |
67 | I've updated the patch slightly, in order to fix this not working on Youtube (which was caused by the cache on **WebCore::Page*** objects, which remain the same after a refresh). |
83 | can't you /10 instead of *10 ? |
88 | can you create a new variable called "ten_percent" or something like that. |
94 | here you are checking if the *2.5Default** is bigger than 1%. |
95 | If it's not, you return *1.5Default**. |
99 | *size += kDefaultCacheSize 5 / 2;**. |
100 | Otherwise maybe you meant 5/2 when you computer "size", not 3/2. |
104 | I couldn't really find a coherent sorting to theme_resources.grd, so I just added them at the bottom. |
109 | The reason for reverting is: The HeapStatistics::total_available_size returned by v8 API doesn't return the correct value right now. |
128 | I had to merge debugger_remote_service.cc because it would not pass "gcl try" - someone has probably obsoleted WebContents in favor of TabContents, so this patchset involves this merge, too. |
181 | Why not take some of your longer conditions here (like command==kDebuggerCommand and command == kEvaluateJavascript) and extract them to helper functions? |
393 | I don't have a good name, but how about MockKeyboardDriverFoo (Win, Mac and Linux)? |
51/600
is 8.5%
, which is significant; We need to filter out this noise. Nuthan and I discussed a potential process of feeding each sentence to Pygments (and identifying which language(s) are likely to be present in a sentence based on a filetype--review mapping) to determine which sequences of tokens need to be replaced with some unique identifier, such as <<CODE>>
. We're going to think on this.
Problem
There are a lot of newlines
\n
(and carriage returns\r
) within themessage.text
field; my first instinct was to strip these out, however, in doing so, I realized that there are blocks of code in themessage.text
field. (duh!) Stripping out the newlines and carriage returns results in code blocks losing their structure and lots ofr'\s+'
.We should remove code blocks from
message.text
(or at least ignore them during analysis), but how do we do that programatically?Example
Raw
Stripped
r'\\r\\n'
Proposed Solution
My gut reaction in situations like this is to employ a regex, but unless someone has gone through the pain of creating a regex for every (or at least the most common) programming languages, I don't see how that's possible.
If we could detect code blocks within the natural language, we could just replace them with a special token (something like
<<CODE>>
) to make them easier to ignore when parsing.