datakind / data-recipes-ai

MIT License
3 stars 0 forks source link

Feat/pf - End-to-End Github actions tests #61

Closed dividor closed 3 months ago

dividor commented 3 months ago

Reopened to respond to reviewer and add GH actions

dividor commented 3 months ago

Added extras ...

  1. Batch tests file data.jsonl (this will change to use Jan's work), has two simple tests for Promptflow
  2. GitHub action to build environment, run promptflow batch, and check output
dividor commented 3 months ago

This PR finishes off the end-to-end tests first draft. See below for a summary from CONTRIBUTION.md.

Note, it also includes a new demo data zipfile, script to download and documentation, so assistant analysis tests can be run.

End-to-end tests

End-to-end tests have been configured in GitHub actions which use promptflow to call a wrapper around the chainlit UI, or order to test when memories/recipes are used as well as when the assistant does some on-the-fly analysis. To do this, the chainlit class is patched heavily, and there are limitations in how cleanly this could be done, so it isn't an exact replica of the true application, but does capture changes with the flow as well as test the assistant directly. The main body of integration tests will test recipes server and the assistant independently.

Additionally, there were some limitation when implementing in GitHub actions where workarounsd were implemented until a lter data, namely: promptflow is run on the GitHub actions host rather than in docker, and the promptflow wrapper to call chainlit has to run as a script and kill the script based on a STDOUT string. These should be fixed in future.

Code for e2e tests can be found in flows/chainlit-ui-evaluation as run by .github/workflows/e2e_tests.yml

The tests work using promptflow evaluation and a call to an LLM to guage groundedness, due to the fact LLM assistants can produce slightly different results if not providing answers from memory/recipes. The promptflow evaluation test data can be found in flows/chainlit-ui-evaluation/data.jsonl.

A useful way to test a new scenario and to get the 'expected' output for data.jsonl, is to add it to call_assistant_debug.py.

TODO, future work: