Open bbartling opened 20 hours ago
One approach is using an LLM agent as THE JUDGE. We can make API calls to GPT4o-mini(as its cheaper) and ask it to evaluate the similarity of the given answer and expected response. I have access to free GPT4o API calls through the lab so I can test those things when i have more time!
What should we use in HVAC evaluations? @ozanbarism
Like in
scripted_compare_models.py
what should we be asking the LLM and then how to rank results? https://github.com/bbartling/HvacGPT/blob/develop/scripted_compare_models.pyCurrently the
from sklearn.feature_extraction.text import TfidfVectorizer
is just something for fun to start with but in reality doesn't work very well.This would almost be interesting to ask an actual engineering community what questions and answers we should expect... this was only done in 5 seconds very fast with not enough thought...