make coding scores based on unit tests

Hey,

great initiative to track local llms!

Would you be open to talking about how the scores are created?

I created some gpt-4 scores in a project in the past and found them not good enough (they would fluctuate based on input sentences with the same meanings, scores somewhat too arbitrary, different days would give different scores for the same input). At least you should pin the gpt-4 version so you have better control when they roll updates to gpt-4
For code one could add unit tests to check the created functions

Troyanovsky / Local-LLM-Comparison-Colab-UI