librarian: benchmark for librarian language models

Yoshiki is working on making a benchmark program. The example is to extract year (or time span) and description of an event or an artifact from a talk transcript. To see how a language model works well for the task, we need to have a judge program that says expected items are in the result and not unexpected results. The latter comes down to comparing two entries [{year, description}, {year: description}].

Currently I am working on the judge part of it. I lifted some code from llama cpp front end JS code to make it a node js code. I have test data that are pairs of such and have expected result of yes or no. The llava of llamafile 1.5 is not good at this. and I found some timing issues in my JS code as of writing. Finding a model that can do a good job, and fixing the timing issues in my code are the next step.

ajbouh / substrate

librarian: benchmark for librarian language models #26