Improving code translation and evaluation of code generation

Please describe the issue. In prior projects, it is not very difficult to generate code, but it is difficult to evaluate how well it was generated; especially for languages that are not very popular. For example, we were trying to generate a model through GPT-3 that can auto-complete SMART contracts. However, there were no non-manual ways available to evaluate Solidity that balances logical and semantic correctness. Bleu-score exists, but it is only currently easily integratable for Java and Python. People don't have extensive time to tweak Bleu-score metric for Solidity code. So, we want to create an AI model to modify Bleu-Score needs based on prompts in a new language.

How is this issue actionable? This would require people with Solidity and NLP expertise to come on-board. Also, preferably people who are familiar with code for Bleu-score and have deep understanding of how it is implemented for Python and Java, so that a more generalizable model can be created for new languages. Also, people who understand language structures and how to measure which languages are similar to one another would be useful for this project.

Additional context We may use language templates for different kinds of languages so that languages of structural similarity use similar Bleu-score models. We might use K-Nearest Neighbors or some clustering algorithm to measure this similarity.

OREL-group / Project-Management-SP23

Improving code translation and evaluation of code generation #270