flowaicom / flow-judge

Code for evaluating with Flow-Judge-v0.1 - an open-source, lightweight (3.8B) language model optimized for LLM system evaluations. Crafted for accuracy, speed, and customization.
Apache License 2.0
53 stars 8 forks source link

Feat/baseten integration #18

Closed sariola closed 1 month ago

sariola commented 1 month ago

Flow Judge Baseten Model

Summary

This PR introduces FlowJudge Model instantiation using Baseten. It also deploys the model on Baseten on first use.

Key Changes

  1. Created aBaseten class in the models folder:

    • Encapsulates the 'engine' functionality to initialize and run the model
    • Attaches to the Baseten API adapters
    • Refactored according to the changes in main branch
  2. Created Baseten Adapters:

    • Handling of Webhook requests using /async_predict route from Baseten
    • Sync requests using the openai standard
  3. Baseten Deployment:

    • Authentication with Baseten and setting of API key
    • Model deployment using Truss for the defined model config in model/adapters/baseten/config
    • Production deployment set as the default.

Testing

Manually tested:

Breaking changes

  1. Pyproject.toml
    • Introduced optional dependency for baseten: baseten = ["truss>=0.9.42"]
sariola commented 1 month ago

What is the reason behind these errors?

@alexwegrzyn Failed to send async predict results for request 35643965633151487de2db4ef135dd15 to webhook endpoint https://proxy.flowrite.com//webhook. Status code: 301, response:

Does the double slash play a role? It's error behavior from baseten logs from last night and from today.

@minaamshahid vLLM has gone into an unhealthy state due to error: , restarting service now... To get more information on this edit the model.py engine flowaicom/baseten/blob/main/model/helper.py and re-deploy.

ghost commented 1 month ago

What is the reason behind these errors?

@alexwegrzyn Failed to send async predict results for request 35643965633151487de2db4ef135dd15 to webhook endpoint https://proxy.flowrite.com//webhook. Status code: 301, response:

Does the double slash play a role? It's error behavior from baseten logs from last night and from today.

Yes, its because of the double slash:

# curl https://proxy.flowrite.com//webhook
<a href="/webhook">Moved Permanently</a>.

The double slash was probably introduced in the client in webhook_url parameter, see here. The self.webhook_proxy_url probably already has trailing slash and another one is added with +"/webhook" and not normalized before being shipped to Baseten.

sariola commented 1 month ago

The double slash was probably introduced in the client in webhook_url parameter, see here. The self.webhook_proxy_url probably already has trailing slash and another one is added with +"/webhook" and not normalized before being shipped to Baseten.

Got it, looks like might be due to manually inserting the inputs, is it? @minaamshahid

ghost commented 1 month ago

Got it, looks like might be due to manually inserting the inputs, is it? @minaamshahid

I was able to confirm earlier with Minaam that this is indeed the case. The tests were using the url with trailing slash and they are already fixed.

minaamshahid commented 1 month ago

The double slash was probably introduced in the client in webhook_url parameter, see here. The self.webhook_proxy_url probably already has trailing slash and another one is added with +"/webhook" and not normalized before being shipped to Baseten.

Got it, looks like might be due to manually inserting the inputs, is it? @minaamshahid

Yes!

minaamshahid commented 1 month ago

@sariola I've pushed an update with the suggested changes (aiohttp, openai client). There is a change to the pyproject.toml for the 'dev' deps: pytest-asyncio>=0.23.6, <0.24.0 from >0.24.0 This was a conflicting one with truss

@sariola There is a conflicting dependency with pytest-asyncio when downloading optional dependencies for 'dev' and 'baseten' flow-judge[baseten,dev,hf,llamafile,vllm] 0.1.0 depends on pytest-asyncio>=0.24.0; extra == "dev" truss 0.9.43 depends on pytest-asyncio<0.24.0 and >=0.23.6 Anything against locking version to <0.24.0 and >=0.23.6 in the repo? From the changelog at least, I don't see changes that would affect our usage in the repo.

sariola commented 1 month ago

@sariola I've pushed an update with the suggested changes (aiohttp, openai client). There is a change to the pyproject.toml for the 'dev' deps: pytest-asyncio>=0.23.6, <0.24.0 from >0.24.0 This was a conflicting one with truss

@sariola There is a conflicting dependency with pytest-asyncio when downloading optional dependencies for 'dev' and 'baseten' flow-judge[baseten,dev,hf,llamafile,vllm] 0.1.0 depends on pytest-asyncio>=0.24.0; extra == "dev" truss 0.9.43 depends on pytest-asyncio<0.24.0 and >=0.23.6 Anything against locking version to <0.24.0 and >=0.23.6 in the repo? From the changelog at least, I don't see changes that would affect our usage in the repo.

Great! Thank you M.

Looks neat! I'll test it e2e tonight with a new account :muscle:

PS Yeah the clash doesn't seem significant, good call.

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 56.21118% with 141 lines in your changes missing coverage. Please review.

:white_check_mark: All tests successful. No failed tests found.

Files with missing lines Patch % Lines
flow_judge/models/baseten.py 0.00% 76 Missing :warning:
tests/unit/models/test_baseten.py 90.04% 20 Missing :warning:
flow_judge/models/common.py 0.00% 19 Missing :warning:
flow_judge/flow_judge.py 0.00% 16 Missing :warning:
flow_judge/__init__.py 0.00% 7 Missing :warning:
flow_judge/metrics/presets.py 0.00% 2 Missing :warning:
flow_judge/metrics/__init__.py 0.00% 1 Missing :warning:
Files with missing lines Coverage Δ
flow_judge/metrics/metric.py 0.00% <ø> (ø)
...sts/e2e-local/integrations/test_llama_index_e2e.py 91.66% <ø> (ø)
tests/e2e-local/models/test_llamafile_e2e.py 86.86% <ø> (ø)
tests/unit/models/test_llamafile_unit.py 100.00% <ø> (ø)
tests/unit/test_flow_judge.py 98.09% <ø> (ø)
tests/unit/test_metrics.py 100.00% <ø> (ø)
tests/unit/test_utils.py 100.00% <ø> (ø)
flow_judge/metrics/__init__.py 0.00% <0.00%> (ø)
flow_judge/metrics/presets.py 0.00% <0.00%> (ø)
flow_judge/__init__.py 0.00% <0.00%> (ø)
... and 4 more

... and 1 file with indirect coverage changes