EleutherAI / the-pile

MIT License
1.46k stars 126 forks source link

FreeLaw Project #33

Closed thoppe closed 3 years ago

thoppe commented 3 years ago

Looks similar to #27:

URL https://www.courtlistener.com/api/bulk-info/

Free Law Project seeks to provide free access to primary legal materials, develop legal research tools, and support academic research on legal corpora. We work diligently with volunteers to expand our efforts at building an open source, open access, legal research ecosystem. Currently Free Law Project sponsors the development of CourtListener, Juriscraper, and RECAP. We currently have 423 courts that can be accessed with our APIs.

For each court there appears to be a file collecting all information on each case heard. A sample download of "Court of Appeals for the First Circuit" with 35K entries is about 500MB. The data seems to be organized with a useful field of "text" or "html", the later of which can be reduced with pandoc. There is definitely overlap with #27, though it's unclear how much. An example:

https://www.courtlistener.com/opinion/4242578/sandquist-v-lebo-automotive-inc/?q=Sandquist%20v.%20Lebo%20Automotive&type=o&order_by=score%20desc&stat_Precedential=on&court=cal

https://cite.case.law/cal-5th/1/233/

According to their site, it looks like they might be a strict subset. By the numbers it looks to be about half the size, "3,676,348: Number of precedential opinions in CourtListener." vs the claimed 6M for case.law. The upshot is that this service is free w/o an account to access all states and can begin parsing immediately.

thoppe commented 3 years ago

Closing as it looks like #27 is larger and already obtained.

thoppe commented 3 years ago

Reopening. I pulled this anyways and it's much larger than I thought once processed. Much of the raw was in formatted HTML, so even after converting that to text the dataset it over 50GB. Suggesting we use this instead of #27 or if we find a way to deduplicate, use both.

StellaAthena commented 3 years ago

Agreed! Let's go with this.

thoppe commented 3 years ago

@StellaAthena when you get a chance, please assign me to this issue

thoppe commented 3 years ago

Data is pull and upload is complete. Replication code can be found at:

https://github.com/thoppe/The-Pile-FreeLaw

and a tmp link to the dataset can be found at:

https://drive.google.com/file/d/1L-x3g3V888gHNUVHQWDkJBJHs5N02Kjz/view?usp=sharing

Once it has been incorporated into The-Pile, I'll close the issue.