We compare the scores of the top 10 results within .1 tolerance.
This can be improved. At one point, we had exact parity with ColBERT on CPU and not much has changed. Now, we're comparing LintDB on CPU with ColBERT on GPU.
It's possible there's some divergence in the results now. I need to re-extract a ColBERT branch that can output intermediate steps for comparison.
This PR adds a test on ColBERT.
We compare the scores of the top 10 results within .1 tolerance.
This can be improved. At one point, we had exact parity with ColBERT on CPU and not much has changed. Now, we're comparing LintDB on CPU with ColBERT on GPU.
It's possible there's some divergence in the results now. I need to re-extract a ColBERT branch that can output intermediate steps for comparison.