UnitTestBot / UTBotJava

Automated unit test generation and precise code analysis for Java
Apache License 2.0
132 stars 38 forks source link

[Umbrella Ticket] Make ML path selection better: improve accuracy, speed up the inference and reduce the size of jar #703

Open amandelpie opened 2 years ago

amandelpie commented 2 years ago

This is an umbrella ticket for many sub-tasks

Description

The existing ML path selection is implemented in the utbot-analytics module It suffers from a few problems:

  1. It uses external ML libraries for the model inference. It brings large size of jar
  2. It uses Smile library for inference (better to use scikit-learn and provide the model importer)
  3. Smile wrapper for blas is used for Matrix multiplication
  4. Kotlin implementation without external runtime is too slow (need our own native implementation of 1-3 operations like matrix mul) - probably multik could help
  5. The DJL inference is too slow
  6. The imported library in JSON/txt format
  7. We measure the metrics on the contest data
  8. The utbot-analytics module de-facto is not used.
  9. There a lot of ML-related settings mixed together with another settings in UtSettings

Expected behavior

  1. utbot-analytics module and its inheritors should be easily enabled/disabled from the intellij/cli modules
  2. Scripts for training should be structured and isolated
  3. Deployed ML models should be a part of jar
  4. No external libraries in the utbot-analytics module
  5. External settings should be extracted to the UtMLSettings
  6. Models are located in resources and packed with the plugin
  7. Models are not larger than 100 KB (zipped or saved in alternative binary format, not json or txt)
  8. utbot-analytics module contains only interfaces and pure Kotlin implementations
  9. utbot contains separate modules for model inference for the custom inference implementations (like DJL)
  10. Different path selectors could be easily compared and results could be displayed as a report
  11. The new metrics of path selection are created
  12. We reached better (significantly) numbers in metrics
  13. Obtained models are ranged and well described
  14. Training process and hyperparameter tuning is well described and published.

Related issues

amandelpie commented 1 year ago

So it was a cool idea and research that requires a lot of time unfortunately the time was not found and depends on success of custom path selectors