[Eval] Hybrid search evaluation

After issue #76 (elastic search integration), we should examine how it affects the evaluation.

Test set: ta1

Comparisons:

Takeaways:

In most cases, the Hybrid performs well, yielding identical or similar results in (3) and (4).
In 10-20% of cases, terms pre-filtering is too stringent and returns no results.
We still have challenges with key terms, such as:
- SV2AIR3 model formula
- What is the SIDARTHE-V model?
- Differences between the original SIDARTHE and SIDARTHE-V
Further work is needed on evaluation questions without key terms. Current performance is neither better nor worse than XDD V2. Improved metrics are needed to quantify results.

Source: details

UW-Madison-DSI / ask-xDD