JohnlNguyen / semantic_code_search

Semantic Code Search building on a fork from tensor2tensor
Apache License 2.0
0 stars 1 forks source link

Second task #3

Closed VHellendoorn closed 5 years ago

VHellendoorn commented 5 years ago

It might be worthwhile to run our models on this code captioning task, presented in this paper. A recent ICLR paper actually evaluated Transformer models on that task (Sec. 4.2), so it should be worthwhile to see if we can do better with e.g. AST info or cyclic translation.

JohnlNguyen commented 5 years ago

Would this be the same as doing code -> intent translation?

Also do you think it makes more sense to embed the inputs as tokens or subwords?

VHellendoorn commented 5 years ago

I'd say it's quite similar, though the paper does mention that they sometimes have several descriptions; not sure if those are actually used in the benchmark.

I bet you could get quite some mileage out of BPE encoding, perhaps even encoding the code and vocabulary together -- this often boosts machine translation if the languages are in the same character space (and hasn't been done for code). Here's a BPE encoder I've used before; should be pretty easy to get going. Tentatively assigning @uzillion because I bet you are plenty with the tree stuff already.

uzillion commented 5 years ago

Due to probably a silly mistake on my part my combined vocab file ended up with a lot of token missing. And as a result, I ended up with a lot UNKs. Before I retrain the mode, I wanted to make sure my preprocessing steps are right.

I am removing the white spaces and also removing the whitespace special token </w> from the learned tokens list. For example: KE Y_THAT_MIGHT_EXIST</w> was turned into KEY_THAT_MIGHT_EXIST. Also in the file the actual code and intent files, the subwords are represented using the @@ symbol. I eliminate these symbols as well. So c . decode ( '@@ unicode_@@ escape' ) becomes c . decode ( ' unicode_ escape' ). Are these steps correct?

The actual spaces are encoded beforehand with the #SPACE# token.

VHellendoorn commented 5 years ago

Hi, I'm probably gonna need a bit more context for this. Typically to use BPE encoding, you just run the subword-nmt (or SentencePiece) tool on a directory containing your data (first learn_bpe, then apply_bpe). If your data is space-separated, it should pretty much work out of the box; just check the vocabulary it creates and the files after applying BPE.

No need to integrate this into the broader training/testing process, just create a separate BPE pre-processed dataset and throw that at the model to see what it does. Which part of this is given you problems?

uzillion commented 5 years ago

Yes, I did all of the above. I wasn't sure whether to remove the spaces or not from between them, because in a GitHub T2T issue, it said that in the vocab file, each token needs to be newline separated. And I know that subword-nmt generates the merge candidates without actually merging them. Also, I guess I was a bit confused about the space-separated subword tokens in the vocab file that gets generated by learn-bpe command because in the actual data file generated by apply-bpe the subword tokens were separated by @@ (2 at symbols and a whitespace).

In any case, T2T had an inbuilt BPE vocab generator. We were not aware of this before, and we ended up using it.

As you had predicted, we saw a significant improvement in intent to code output with just around 0.23 BLEU. Thanks a lot for your help.