biocypher / biochatter

Backend library for conversational AI in biomedicine
http://biochatter.org/
MIT License
51 stars 19 forks source link

[Bug] Not all benchmarks are being run with --run-all flag #137

Open WagnerJon opened 3 months ago

WagnerJon commented 3 months ago

Some benchmarks are being skipped even with inclusion of "--run-all" flag

To Reproduce Steps to reproduce the behaviour. "pytest benchmark --run-all"

Stack trace What is the exact error message? collected 106 items

benchmark/test_vectorstore_semantic_search.py ss [ 1%] benchmark/test_biocypher_query_generation.py ..s......s................................................ [ 56%] ................... [ 74%] benchmark/test_rag_interpretation.py .s.s.ss.s..s.s.s [ 89%] benchmark/test_biocypher_query_generation.py sssssssssss [100%]

================================= 83 passed, 23 skipped in 212.76s (0:03:32) ================================== Expected behavior All benchmarks should be run.

Desktop (please complete the following information): platform linux -- Python 3.10.12, pytest-8.0.2, pluggy-1.4.0

Additional context Add any other context about the problem here.

nilskre commented 3 months ago

When the -run-all flag is set, the content of the result files should be deleted (done here). So it should only contain the header without any content:

model_name,subtask,score,iterations,md5_hash,datetime

Can you verify, that the content of the files is deleted?

slobentanzer commented 3 months ago

Could you please debug the tests instead of running them with --run-all? If this is just about making sure that the benchmark runs, running one test should be enough. If you want to figure out why a test is failing or skipped, debugging will equally allow you to check. Same goes for file deletion etc.

Just to clarify: the expected behaviour is not that all tests are run, because some contain skip conditions (you can see them easily in the test code). It does not make sense to run "implicit" RAG evaluation with an "explicit" prompt, for instance: https://github.com/biocypher/biochatter/blob/1d27e3214fc96cef0833422bbb77b627970aaa45/benchmark/test_rag_interpretation.py#L73

nilskre commented 3 months ago

benchmark/test_biocypher_query_generation.py ..s......s................................................ [ 56%] -> the first skipped test is the test case where the expected relationships are empty here.

benchmark/test_biocypher_query_generation.py sssssssssss [100%] -> is probably coming from here