devjeetr / DeepTC-Enhancer-Improving-the-Readability-of-Automatically-Generated-Tests

Replication package for our ASE 2020 Paper
3 stars 1 forks source link

About the Identifier Renaming training process #3

Open ShangwenWang opened 3 years ago

ShangwenWang commented 3 years ago

Dear authors, Thanks for your interesting work! After reading the paper, I am quite unsure about the Identifier Renaming training process. I know that given a test, you mask all variable names. What's then? Do you use the code2seq encoder-decoder to generate the masked variables (including the test name) as a sequence? I did not get this from the paper.

I would appreciate it if you could help me better understand this. Thanks a lot.

devjeetr commented 3 years ago

Hi!

What we do is pretty simple! We mask the variable names as you said as the first step. Each variable gets a different mask, with a special mask for the variable that needs to be predicted. Suppose we have two variables in a test method, string0 and string1, and we want to suggest a name for string0. We rename string0 to TARGET_VAR and string1 to VAR1. Then code2seq knows it needs to give us a name for string0. Architecturally, we use code2seq as is, without any modifications. Then, to generate a name for string1 we will switch the masks, with string1 becoming VAR1 and string1 becoming TARGET_VARIABLE. We are basically running the decoding step once per variable. A joint prediction approach could also be effective here (ala JSNice but with RNN encoders as feature embeddings), but we didn't explore it.

In addition, our model to generate method names is actually a separate model (and is basically the vanilla code2seq implementation). The other thing you have to be careful about is the sampling of the AST paths. You want to ensure that a certain percentage of the AST paths that you use contain the target variable in it. For our dataset, we found it best to include only paths that include the target variable, in terms of input size & overall performance.

Note that for both variable name and method name prediction, we mask all variable names and method names in the input, as we assume that the names in the automatically generated tests are not meaningful.

I will add that at this time, the literature suggests that using a single model for both variable name and method name prediction would be more effective than the approach we took for this paper. You could look into pretrained approaches such as codeBERT and T5. At the time of writing the paper, we tried using this approach with transformers but we were missing a key piece (using Shaw et al's relative positional encoding instead of the absolute encodings as originally proposed) which was brought up by Ahmad's paper on Transformers for code summarization.

Let me know if there's anything else that remains unclear.

ShangwenWang commented 3 years ago

Hi Thanks a lot for your quick answer. I'd like to know that in your example, does VAR1 always denotes a special mask? In my opinion, if we have already predicted the value of string0, then we could use it to predict the value of string1. Thanks