Open laugustyniak opened 1 year ago
[ ] MUST READ at the beginning -> https://github.com/beir-cellar/beir/wiki/Load-your-custom-dataset it helps with understanding of BEIR data formats
[ ] create a mapping from https://huggingface.co/datasets/squad_v2 to BEIR format -> idea to name load_hf_to_beir, use validation set of even load only part of this data to test
load_hf_to_beir
validation
[ ] run test in colab using https://colab.research.google.com/drive/1HfutiEhHMJLXiWGT8pcipxT5L2TpYEdt?usp=sharing#scrollTo=tC2L6VWtAS5J (colab example from BEIR repo)
[ ] Based on results obtained from BEIR, create a simple sample submission for LEPISZCZe https://embeddingsclarinpl.netlify.app/submission/ - it be only with metrics, dataset name, model, name, and task https://github.com/CLARIN-PL/embeddings/blob/043977c852dc87ea01b8bc7f3383e0ebdf6912f8/webpage/data/results/msmarco_bm_25.json
[ ] Add generated JSON file to the lepiszcze https://embeddingsclarinpl.netlify.app/submission/
some similar ideas:
from beir.datasets.data_loader_hf import HFDataLoader corpus, queries, qrels = HFDataLoader(hf_repo=f"clarin-knext/{dataset}", streaming=False, keep_in_memory=False).load(split=split) # Conversion from DataSet queries = {query['id']: {'text': query['text']} for query in queries} corpus = {doc['id']: {'title': doc['title'] , 'text': doc['text']} for doc in corpus}
@mkossakowski19 can you link the branch for it?
[ ] MUST READ at the beginning -> https://github.com/beir-cellar/beir/wiki/Load-your-custom-dataset it helps with understanding of BEIR data formats
[ ] create a mapping from https://huggingface.co/datasets/squad_v2 to BEIR format -> idea to name
load_hf_to_beir
, usevalidation
set of even load only part of this data to test[ ] run test in colab using https://colab.research.google.com/drive/1HfutiEhHMJLXiWGT8pcipxT5L2TpYEdt?usp=sharing#scrollTo=tC2L6VWtAS5J (colab example from BEIR repo)
[ ] Based on results obtained from BEIR, create a simple sample submission for LEPISZCZe https://embeddingsclarinpl.netlify.app/submission/ - it be only with metrics, dataset name, model, name, and task https://github.com/CLARIN-PL/embeddings/blob/043977c852dc87ea01b8bc7f3383e0ebdf6912f8/webpage/data/results/msmarco_bm_25.json
[ ] Add generated JSON file to the lepiszcze https://embeddingsclarinpl.netlify.app/submission/
some similar ideas: