Open Samoed opened 6 days ago
Hi @Samoed, thanks for all the efforts you put in! I'll run what's missing π
@Samoed question: do you know why we don't have the number of qrels for retieval datasets?
e.g., MIRACLRetrievalHardNegatives
, we have num_queries
and num_documents
but not the number of qrels. We can maybe infer an average from average_relevant_docs_per_query
, any idea why we don't have the exact number?
{
'number_of_characters': 983901912,
'num_samples': 2460458,
'num_queries': 11076,
'num_documents': 2449382,
'min_document_length': 5,
'average_document_length': 0.1694358005407078,
'max_document_length': 176,
'unique_documents': 2449382,
'min_query_length': 1,
'average_query_length': 88794.41124954858,
'max_query_length': 48538,
'unique_queries': 11076,
'min_relevant_docs_per_query': 1,
'average_relevant_docs_per_query': 2.3643011917659806,
'max_relevant_docs_per_query': 20,
'unique_relevant_docs': 98836,
'num_instructions': None,
'min_instruction_length': None,
'average_instruction_length': None,
'max_instruction_length': None,
'unique_instructions': None,
'min_top_ranked_per_query': None,
...
}
I forgot to add themπ . Can you add missing fileds? I can recompute rest of the tasks
@imenelydiaker how is this one going? do you need any help with that?
@imenelydiaker how is this one going? do you need any help with that?
Launched it one hour ago, it's running I'm at MSMarcov2, the datasets are quite huge. Maybe we can split? Do you think you can do datasets starting from Neuclir*?
@dokato if you still want to run some datasets, you'll need the code I added here to get the number of qrels.
So just checkout to this branch above and run the script above from Neuclir
? Sure, I can do that!
So just checkout to this branch above and run the script above from
Neuclir
? Sure, I can do that!
Yes, better if you create a new branch from it so we won't have conflicts (because I'm working on it). For the PR, open it directly on the branch v2.0.0
, thank you! π
@imenelydiaker @Samoed
So in this PR I added 7/10 of datasets from my part (starting from NeuCLIR2022Retrieval
). Sadly, those datasets below failed repeatedly due to lack of memory:
Currently, I have only access to D8s_v3 VM with 32gb of RAM and 8CPUs, 120gb SSD, so I'd say not too bad. So maybe some optimizations are needed, or even more beefy machine. I'll be OOK for the next 4 days, can take a look back at those if you have some pointers.
I can try to run them on working machine
@dokato thank you so much! @Samoed let me know if you can't run them π
I found a bug in the statistics calculation (I accidentally swapped docs and queries when calculating lengths) and will recalculate the incorrect stats
I've added descriptive statistics to almost all tasks, but I need help running some of them. To do this, you can run the script from the
v2.0.0
branch.cc @imenelydiaker