Training Intent to Code using Transformer

JohnlNguyen commented 5 years ago

Metadata about the problem

training data is unshuffled
I used rewritten intent however if the rewritten is missing from a row I will use intent
typical sequence length is around 20

Rewritten Intent to Code 6 Layers

Test BLEU .30

approx_vocab_size = 2**13 ~ 8k

Hyperparameters

Full hyperparameters set

 'batch_shuffle_size': 512,
 'batch_size': 1024,
 'hidden_size': 512,
 'learning_rate': 0.4,
 'learning_rate_decay_steps': 5000,
 'learning_rate_warmup_steps': 400,
 'length_bucket_step': 1.1,
 'max_length': 256,
 'max_relative_position': 0,
 'max_target_seq_length': 0,
 'min_length_bucket': 8,
 'num_heads': 8,
 'num_hidden_layers': 6,
 'self_attention_type': 'dot_product',
 'train_steps': 10000000,

Training Loss Screen Shot 2019-05-08 at 11 02 29 PM

Rewritten Intent to Code 2 Layers

Same hyperparameters as above however num_hidden_layers==2 Screen Shot 2019-05-08 at 11 37 43 PM

Code to Rewritten Intent

Screen Shot 2019-05-09 at 12 22 21 AM

VHellendoorn commented 5 years ago

Hi, not sure if this is work in progress, but it'd be nice to see a few more details. I think it'd be good to know:

some hyperparameters (esp. batch size, typical sequence length, vocabulary size and cut-off),
maybe the corresponding BLEU score change over the course of training (e.g. run a brief validation on ~500 sequences every 1K minibatches),
and a training curve for a smaller model (or bigger, if this is 2 layer)
some meta-details: is the training data shuffled (either by you or originally)? Did you try both types of "intent"? Even the reverse training curve (code -> NLP) might be good to have.

JohnlNguyen commented 5 years ago

Sorry, I was still working on the initial comment, would you mind take a second look?

I didn't specify the typical sequence length. @VHellendoorn

The next step I would like to do is to train with the Github docstring-function data which has around 1 million pairs and is already a built-in problem in tensor2tensor.

Github Problem

Interestingly, adding the GitHub doesn't help when boosting the accuracy on the conala test set. This may due to the fact that the GitHub data is not preprocessed the same way as the conala data. Screen Shot 2019-05-09 at 8 53 48 AM

VHellendoorn commented 5 years ago

Very interesting, thanks for adding an abundance of details. So it looks to me like (rewritten) intent can be generated pretty well from code. Not so much vice versa, which is quite interesting because the training curve looks healthy. One thing we should maybe look into first is ensuring that it's not an issue with the evaluation and/or data. Could you post some samples of code it generates, e.g. when the training loss gets down around 2? It may also be good to quickly hack together a baseline (2-layer bi-dir RNN + attention should do it); I can share some starter code for that, or I bet there are lots of examples online.

It's also interesting that the Github data didn't help. I wonder if it's because the intents there are a bit more elaborate; perhaps it'd be worth training with just the part up to a DCNL tag (if any). For instance, see this example, where the second half of the description (although useful) is completely different from anything in the Conala dataset.

VHellendoorn commented 5 years ago

P.S.: It looks like the challenge distinguishes between training only on annotated data (and only using rewritten_intents from that one) and training on all (using rewritten where available). It might be good to also run a model on just the annotated data; it should be cleaner and much smaller (i.e. good for prototyping), and we know for a fact that we should be able to get over 10 BLEU to be in the race

JohnlNguyen commented 5 years ago

@VHellendoorn Actually, There was a problem with the way I tokenized the code. I fixed that problem by using the same tokenizer and preprocessing script from the baseline model. After this, I actually got Test BLEU of .30, (generating code from rewritten intent), as you can see by the Rewritten Intent to Code 6 Layers section. It significantly outperforms the baseline model.

Vice versa, generating intent from the code I got a validation BLEU around .20 as you can see by Code to Rewritten Intent section.

Here are some samples Generated Code Input: send a signal signal.SIGUSR1 to the current process Output: def send_signal self signal self signal signal Ref: os.kill(os.getpid(), signal.SIGUSR1)

Input: decode a hex string '4a4b4c' to UTF-8. Output: def b58decode_chk v return b58decode v Ref: bytes.fromhex('4a4b4c').decode('utf-8')

Input: concatenate a list of strings ['a', 'b', 'c'] Output: def join_string_list s return join map str s Ref: """""".join(['a', 'b', 'c'])

VHellendoorn commented 5 years ago

Interesting, thanks. From the outputs you show, should I gather that the model doesn't work very well, or is there some specific tokenization that is being used to make it look like this?

Was this trained with the full corpus, or rewritten only? Looks like the state-of-the-art is about 35 BLUE on intent -> code (which I believe is the only part evaluated in the challenge), so 20 is a good start, but I bet we can push it further. Are you experiencing any remaining issues with training? Otherwise we can focus on task-specific optimizations.

JohnlNguyen commented 5 years ago

This was trained with the full corpus.

I don't have any remaining issues with training, and I think we can focus on task-specific optimizations. Would you be free to meet next week to talk about this?

VHellendoorn commented 5 years ago

Yeah definitely, only my Monday is full, but Tuesday e.g. after class I'm available

JohnlNguyen / semantic_code_search