allenai / scirepeval

SciRepEval benchmark training and evaluation scripts
Apache License 2.0
67 stars 9 forks source link

Huggingface Dataset Config Mismatch #60

Closed tgebhart closed 10 months ago

tgebhart commented 10 months ago

I have run into an issue with downloading the HuggingFace (HF) evaluation dataset, resulting in errors being thrown in the evaluation/benchmarking descriptions given in the repository instructions.

I believe I have localized this issue to the config on HF itself. For example, running

import datasets
data = datasets.load_dataset('allenai/scirepeval', 'scidocs_view_cite_read', split="evaluation")

as is done in the eval_dataests.py file results in a ValueError, ultimately thrown by HF:

"name": "ValueError",
"message": "Couldn't cast\nquery: struct<doc_id: string, title: string, abstract: string, sha: string, corpus_id: uint64>\n  child 0, doc_id: string\n  child 1, title: string\n  child 2, abstract: string\n  child 3, sha: string\n  child 4, corpus_id: uint64\npos: struct<doc_id: string, title: string, abstract: string, sha: string, corpus_id: uint64>\n  child 0, doc_id: string\n  child 1, title: string\n  child 2, abstract: string\n  child 3, sha: string\n  child 4, corpus_id: uint64\nneg: struct<doc_id: string, title: string, abstract: string, sha: string, corpus_id: uint64>\n  child 0, doc_id: string\n  child 1, title: string\n  child 2, abstract: string\n  child 3, sha: string\n  child 4, corpus_id: uint64\n-- schema metadata --\nhuggingface: '{\"info\": {\"features\": {\"query\": {\"doc_id\": {\"dtype\": \"strin' + 732\nto\n{'doc_id': Value(dtype='string', id=None), 'corpus_id': Value(dtype='uint64', id=None), 'title': Value(dtype='string', id=None), 'abstract': Value(dtype='string', id=None), 'venue': Value(dtype='string', id=None), 'n_citations': Value(dtype='int32', id=None), 'log_citations': Value(dtype='float32', id=None)}\nbecause column names don't match",

I believe this error to be a result of the collection of recent commits made to the HF repo over the past two weeks. I can verify this as the issue by pinning the load_dataset function to the initial version on HF via:

import datasets
data = datasets.load_dataset('allenai/scirepeval', 'scidocs_view_cite_read', split="evaluation", revision='bd4180cb5a4db6823e86dccd3c317f65dfe980ee')

which functions as expected.

Unfortunately, I am not familiar with HF dataset upload/config, so I cannot diagnose this issue further or submit a PR fix on HF.

amanpreet692 commented 10 months ago

61 Should resolve this.

Thanks for reporting this! The recent changes made to the HF repo that you refer to, were to enable the dataset viewer on the hub. Unfortunately, those changes were incompatible with the older version of the datasets library used by SciRepEval. The version and requirements.txt has now been updated so pulling the latest changes and reinstalling with pip install -r requirements.txt should fix things.

tgebhart commented 10 months ago

Thanks!