Paper Review: Robustness, Security, Privacy, Explainability, Efficiency, and Usability of Large Language Models for Code

Publisher

arXiv (Submitted to TOSEM)

Link to The Paper

https://arxiv.org/pdf/2403.07506

Name of The Authors

Zhou Yang, Zhensu Sun, Terry Zhuo Yue, Premkumar Devanbu, David Lo

Year of Publication

2024

Summary

This paper presents a comprehensive survey of 146 studies on non-functional properties of large language models for code (LLM4Code) beyond accuracy, including robustness, security, privacy, explainability, efficiency, and usability. The survey identifies the current state-of-the-art techniques for evaluating and enhancing these properties, discusses the challenges and potential research opportunities, and proposes three perspectives on how to incorporate these properties when developing LLM4Code. The findings highlight the need for more effective evaluation methods, techniques to mitigate issues such as low robustness and privacy leaks, and a better understanding of the trade-offs between different properties.

Contributions of The Paper

Identification of seven important non-functional properties of LLM4Code beyond accuracy: The paper systematically reviews the literature and identifies robustness, security, privacy, explainability, efficiency, and usability as crucial properties to consider when developing and evaluating LLM4Code.
Comprehensive review of state-of-the-art techniques: The paper provides a thorough analysis of the current techniques used to evaluate and enhance each of the identified properties, highlighting their strengths and limitations.
Discussion of challenges and research opportunities: The authors discuss the existing challenges in studying and improving these non-functional properties and propose potential research directions to address these challenges.

Comments

Section 5.1: Point 1 highlights the need to identify high-quality training data and advocates for developing tools and techniques for identifying and removing low-quality data points. Section 5.2 - Point 4 highlights the need for automated benchmarking of usability; while usability is not our primary focus, it does support the idea of using AI to simulate human behaviour for LLM evaluation (i.e., mimicking developer interactions with the coding tools)

To facilitate dynamic benchmarking, we can use this paper to support our arguments, and use the following ideas.

Highlight limitations of static benchmarks:
- Use examples from the paper to show how LLM4Code can be non-robust, insecure, leak private information, lack explainability, or have usability issues even when performing well on static accuracy metrics.
- Argue that static benchmarks don't probe important non-functional properties that are critical for real-world LLM4Code deployment and trustworthiness.
- Illustrate how static benchmarks can be "gamed" by fine-tuning LLM4Code on the benchmark data, without improving actual reasoning abilities.
Propose a framework for dynamic benchmark generation:
- Define the space of desired non-functional properties to test, e.g. robustness to input perturbations, security against adversarial attacks, preservation of data privacy, generation of human-interpretable explanations, efficiency on constrained hardware, usability for developers with different backgrounds.
- For each property, identify techniques to automatically generate test cases, e.g. applying common code transformations to create robustness test cases, inserting security vulnerabilities to assess detection, including sensitive info to verify privacy, collecting real user interaction traces to replay for usability.
- Devise algorithms to efficiently explore the input space and generate diverse, informative test cases that stress test the target properties. Incorporate ideas like fuzzing, symbolic execution, and evolutionary optimization.
Demonstrate value of dynamic evaluation:
- Apply your framework to generate dynamic benchmarks for popular LLM4Code and contrast their performance with static benchmark results. Highlight the regressions.
- Showcase the framework's ability to find important bugs and deficiencies that were previously unknown.
- Iteratively improve an LLM4Code using the insights from dynamic evaluation, demonstrating actual gains in reasoning capability, not just higher static benchmark scores.

RAISEDAL / RAISEReadingList