EveripediaNetwork / issues

Issues repo
9 stars 0 forks source link

Development of a simple Solidity benchmark #2947

Open danielbrdz opened 1 month ago

danielbrdz commented 1 month ago

Development of a simple benchmark to evaluate the performance of different models with Solidity code. This benchmark will be used to measure the impact of fine-tuning on the models.

danielbrdz commented 1 month ago

Develop a simple solidity benchmark to measure the performance of this language in LLM in google colab. This development was essential to make it easier and faster to measure the performance and test the performance of the new models.

Script: https://github.com/EveripediaNetwork/iq-code-evmind/tree/master/Barcenas%20Benchmark

Explanation: Installation of dependencies: The script begins by installing the necessary libraries such as transformers, torch, pandas, etc.

Imports: All the required libraries and modules for the script's functionality are imported.

Configuration: Logging is configured and the device (GPU or CPU) for execution is set.

Loading the model and tokenizer: The function load_model_and_tokenizer loads the language model and its associated tokenizer.

Generation of Solidity code: generate_solidity_code uses the model to generate Solidity code based on a given prompt.

Loading examples: load_solidity_examples loads Solidity code examples from a Parquet file.

Evaluation of code quality: evaluate_code_quality analyzes the generated code and assigns a score based on various criteria such as structure, use of Solidity elements, security patterns, etc.

Evaluation of functional similarity: evaluate_functional_similarity compares the generated code with reference code, based on function and event names.

Overall code evaluation: evaluate_code combines the quality and functional similarity scores to provide a final score.

Execution of a single task: run_single_task generates code for a given prompt and evaluates it.

Execution of the complete benchmark: run_benchmark executes the generation and evaluation process for all selected examples.

Visualization of results: visualize_results creates graphs to show the distribution of scores and generation times.

Main function: main orchestrates the entire process. It loads the examples, runs the benchmark, saves the results, and generates the visualizations.

Execution: The script runs by calling the main function.

In summary, the script loads a language model, generates Solidity code based on prompts, evaluates the quality and functional similarity of the generated code, and produces a detailed report with scores and visualizations. This process is repeated for several examples, providing a comprehensive assessment of the model's performance in generating Solidity code.