Supreme Court Oral Arguments Corpus: Update Years

kakeith commented 2 years ago

For a recent project I'm working on, we're using ConvoKit's implementation of the Supreme Court Oral Argument Corpus. However, we'd really like to include data from after 2019.

How difficult would it be to run scripts to update the dataset for cases after 2019?

Thanks, Katie

cristiandnm commented 2 years ago

Hi Katie,

Happy to hear you are finding this data useful in your project! @tisjune developed this corpus, so she might be able to chime in and help with updating it. Although I don't really know how hard it is (e.g., if it involves any manual fixes) or if she has the time at the moment.

Cristian

tisjune commented 2 years ago

Hi Katie -- Unfortunately I don't have a script (or, I forgot the password to the machine that stores the collection of files that more or less document what I did) that pulls/can update the dataset, and there is some manual tinkering involved. In short, if you want to get started:

the data from Oyez is quite well-formatted, especially for more recent years. so a lot of what I say might only apply to older cases, but is nonetheless worth keeping in mind.
A lot of metadata from Oyez can be found in the html source. I used this metadata and heavily filled it in with info from SCDB. (I don't think Oyez has a neat database of metadata beyond whatever generated the html source.)
sometimes it's not actually clear which side the speaker is on, and Oyez doesn't consistently provide vote info from justices. I know the convokit documentation says, rather annoyingly, "documentation forthcoming" on the procedure for inferring speaker side...but basically: 1. rely on the order in which advocates make their case; 2. merge with/check against info from SCDB.
some cases are heard over multiple "conversations" -- there is usually one main "conversation" and some precursors/followups (where I guess justices verbally decide to postpone the hearing or something?)
IIRC there are some inconsistencies in case ID-ing between SCDB and Oyez. There was a database somewhere containing justice opinions that I used to match in the few cases the case IDs did not totally correspond. SCDB contains richer information about case outcome than Oyez, so I think that even for more recent years I'd rely on it to provide that information.

kakeith commented 2 years ago

@tisjune @cristiandnm thanks for replying so quickly!

I'll pass on this info to my collaborators and see if there's interest in trying to update the corpus. If so, would you be interested in us contributing scripts to ConvoKit to make sure this corpus can continued to be updated in the future?

Thanks and best, Katie

cristiandnm commented 2 years ago

Thanks Katie,

Yes, we would be definitely interested in updating the dataset and having scripts ready for future updates. Let us know if we can help along the way.

biaoyanf commented 2 years ago

Hi, @kakeith, I'm also interested in using this data with more updated years. How far have you got? Would that be publicly available if you have the updated data? Thanks!

CornellNLP / ConvoKit

Supreme Court Oral Arguments Corpus: Update Years #168