coastalcph / lex-glue

LexGLUE: A Benchmark Dataset for Legal Language Understanding in English
186 stars 36 forks source link

Number of Target Fields in the SCOTUS dataset on HuggingFace #37

Closed AmanPriyanshu closed 1 year ago

AmanPriyanshu commented 1 year ago

The SCOTUS dataset available as part of the LexGlue corpus mentions 14 classes within the dataset. Upon verification over the HuggingFace SCOTUS dataset, we only get 13 classes through this method.

from datasets import load_dataset  # !pip install datasets
import numpy as np

scotus = load_dataset('lex_glue', 'scotus')
labels = list(scotus['train']['label'])
classes = np.unique(labels)
print(classes, len(classes))

scotus = load_dataset('lex_glue', 'scotus')
labels = list(scotus['test']['label'])
classes = np.unique(labels)
print(classes, len(classes))

The results display on 13 unique classes instead of 14, as shown below.

image

Is there an issue in which we're extracting the data, if so we'd greatly appreciate any help.

iliaschalkidis commented 1 year ago

Hi @AmanPriyanshu, No you're right. There are 14 issue areas based on the SCDB documentation (http://scdb.wustl.edu/documentation.php?var=issueArea), but only 13 of those are presented at least once in our SCOTUS dataset.

AmanPriyanshu commented 1 year ago

I see, we were confused regarding the mention of 14 classes on the HuggingFace documentation. Thank you so much for replying and clarifying my doubt!!