-
This issue reports the work done to evaluate the existing models.
The existing models are the following:
- sct_deepseg_lesion
- sct_deepseg -t seg_sc_ms_lesion_stir_psir
- sct_deepseg -t seg…
-
Thank you for the great work. It's been really helpful.
When I tried to run the benchmark, I found no umls_2023ab_SMM4H-17_cid2name.json for en_norm and no umls_2023ab_MedDRA_cid2name.json for en_k…
-
There are some use cases, where it might be useful for gavel to ingest code or entire components and use an AI model to analyze them.
Recent post on the Martin Fowler Blog: https://martinfowler.com…
-
Dear authors,
thank you for the great work in long-context multi-model evaluation. In the code base, I only saw the code for Azure, OpenAI, Gemini, and Anthropic, could you also provide the evalu…
-
### Feature request
I would like to make an example notebook for evaluating the peft model for reproducable tasks and metrics using the lm-eval harness if possible .
Library here - https://github.…
-
Title.
https://huggingface.co/spaces/mteb/leaderboard
Can use llamafile for local generation: https://future.mozilla.org/builders/news_insights/llamafiles-for-embeddings-in-local-rag-applications/
-
Hi,
Great work!
Did you use the same prompt for all models evaluated on DREAM-1k?
If not, what prompts did you use for different models?
-
Hey @sjahangard
1) I assume this function is used to draw frames from the video and feed them to the image-based model. Is that correct?
https://github.com/JRDB-dataset/JRDB-Social/blob/b9b5ee…
-
## Detailed Description
It would be great to evaluate the live models
## Context
- good to test the different models
## Possible Implementation
- use data from [here](https://huggingface.co/d…
-
I was wondering how the trained models are intended to be evaluated. I don't believe that the paper states how many samples were used to compute the metrics. The code appears to give some indication b…