Data Science for Software Engineering (ds4se) is an academic initiative to perform exploratory and causal inference analysis on software engineering artifacts and metadata. Data Management, Analysis, and Benchmarking for DL and Traceability.
Description
Code Embeddings are abstract representations of source code employed in multiple automation tasks in software engineering like clone detection, traceability, or code generation. This abstract representation is a mathematical entity known as Tensor. Code Tensors allows us to manipulate snippets of code in semantic vector spaces instead of complex data structures like call graphs. Initial attempts focused on identifying deep learning strategies to compress code in lower-dimensional vectors (code2vec). Unfortunately, these approaches do not consider autoencoder architectures to represent code. The purpose of this project is to combine a structural language model of code with autoencoder architectures to compress source code snippets into lower-dimensional tensors. The lower-dimensional tensor must be evaluated in terms of semantics (clone detection).
Disentanglement of Source Code Data with Variational Autoencoder
The performance of deep learning approaches for software engineering generally depends on source code data representation. Bengio, et al. show that different representations can entangle explanatory factors of variation behind the data. We hypothesize that source code data contains these explanatory factors useful for automating many software engineering tasks (e.g., clone detection, traceability, feature location, and code generation). Although some deep learning architectures in SE are able to extract abstract representation for downstream tasks, we are not able to verify such features since the underlying data is entangled. The objective of code generative models is to capture underlying data generative factors. However, a disentangled representation would allow us to manipulate a single latent unit being sensitive to a single generative factor. Separate representational units are useful to explain why deep learning models are able to classify or generate source code without posterior knowledge (or labels). This project aims at identifying single representational units from source code data. We will use CodeSearch Net datasets and Variational Autoencoders to implement the approach.
Project Goals
[ ] Check and analyze literature review in structural models in deep learning
[ ] Implement a vanilla version of an autoencoder where the encoder is a structural language and the decoder a sequence-based architecture.
[ ] Lower dimensional tensors must be evaluated in terms of semantics (clone detection problem)
Implement a module of interpretability to test edge cases of the autoencoder
Project Requirements
Required Knowledge: Python, Git, and Statistics
Preferred Knowledge: Deep Learning, TensorFlow, and DVC
Description Code Embeddings are abstract representations of source code employed in multiple automation tasks in software engineering like clone detection, traceability, or code generation. This abstract representation is a mathematical entity known as Tensor. Code Tensors allows us to manipulate snippets of code in semantic vector spaces instead of complex data structures like call graphs. Initial attempts focused on identifying deep learning strategies to compress code in lower-dimensional vectors (code2vec). Unfortunately, these approaches do not consider autoencoder architectures to represent code. The purpose of this project is to combine a structural language model of code with autoencoder architectures to compress source code snippets into lower-dimensional tensors. The lower-dimensional tensor must be evaluated in terms of semantics (clone detection).
Disentanglement of Source Code Data with Variational Autoencoder The performance of deep learning approaches for software engineering generally depends on source code data representation. Bengio, et al. show that different representations can entangle explanatory factors of variation behind the data. We hypothesize that source code data contains these explanatory factors useful for automating many software engineering tasks (e.g., clone detection, traceability, feature location, and code generation). Although some deep learning architectures in SE are able to extract abstract representation for downstream tasks, we are not able to verify such features since the underlying data is entangled. The objective of code generative models is to capture underlying data generative factors. However, a disentangled representation would allow us to manipulate a single latent unit being sensitive to a single generative factor. Separate representational units are useful to explain why deep learning models are able to classify or generate source code without posterior knowledge (or labels). This project aims at identifying single representational units from source code data. We will use CodeSearch Net datasets and Variational Autoencoders to implement the approach.
Project Goals
Implement a module of interpretability to test edge cases of the autoencoder
Recommended Readings