eye-on-surveillance / sawt

https://sawt.eyeonsurveillance.org/
MIT License
16 stars 10 forks source link

DeepEval Evaluation Functionality [REBASED] #242

Closed outlawhayden closed 6 months ago

outlawhayden commented 6 months ago

Now Rebased for 1 Commit PR Hey All -

We found a super sweet library called DeepEval that could fill a lot of the need for a model evaluation framework. Not only will it look at response quality, but will also evaluate the retrieval of context vs the answer, and some pretty cutting edge math for cool metrics including bias, relevancy, toxicity, and more.

To run the current setup, navigate to the evaluate folder, and then run deepeval test run test_model_cached.py, where the last parameter is one of the python files that start with 'test' in that folder. One of them looks at more of the advanced metrics, one of them reads in a list of prompts from a .csv file into various tests, and one of them bootlegs the original CLI interface in getanswer/main.py to evaluate a query you input live against some metric (answer relevancy is hardcoded in there now but changeable).

Still definitely in the works, and we still have some wrinkles to smooth out, but we wanted to share our progress and create a space for brainstorming where to go in the future!

Underneath it really is just more external API calls to other models to do the more advanced evaluation. The good news is that it's using other OpenAI architecture, you just might have to key in your API key as an environment variable before calling the library (reading from the global env file is still not working, one of the aforementioned wrinkes), but other than that it should work out of the box.

vercel[bot] commented 6 months ago

@outlawhayden is attempting to deploy a commit to the Eye on Surveillance Team Team on Vercel.

A member of the Team first needs to authorize it.

vercel[bot] commented 6 months ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sawt ✅ Ready (Inspect) Visit Preview 💬 Add feedback Feb 28, 2024 4:24am