freelawproject / juriscraper

An API to scrape American court websites for metadata.
https://free.law/juriscraper/
BSD 2-Clause "Simplified" License
378 stars 111 forks source link

How to get Georgia case information #1238

Closed BBC-Esq closed 1 week ago

BBC-Esq commented 1 week ago

I noticed that your program is limited somehow to getting Georgia law, is that correct? I have a Python script that gets it if that would be helpful? Currently, I'm getting over 117,000 cases from combined georgia supreme/appeals courts, just those two.

Let me know if is still an issue and whether a pull request or just submitting a script in the issues might be better. Thanks.

flooie commented 1 week ago

Hi @BBC-Esq - I'm not quite sure I understand what you are asking? Do you mind asking it again in a different way?

flooie commented 1 week ago

Meanwhile, Courtlistener.com, the main repository for scraped opinions and others has ~190,000 opinions from the Supreme Court and Court of Appeals for Georgia.

https://www.courtlistener.com/?q=&type=o&order_by=score%20desc&stat_Published=on&court=ga%20gactapp

We are of course always open to volunteer pull requests if you have courts that are broken or you want to include in the constellation of course juriscraper can collect from.

BBC-Esq commented 1 week ago

My comment was specifically in response to this portion of your readme:

Image

Since you have ~190,000...that's more than I have scraped. My offer was to help scraping but it seems that your readme.md is just lagging...

With that being said, let me ask good sir, what's the best way to download all case info for Georgia? I'd love to index it and hook it up to my vector database program for a hybrid lucene-based and vectordb kind of search...entirely local.

As far as I got was finally finding the data...I'm assuming there's a way to download only the necessary portions and I could probably figure it out myself...but why not ask you since you're such a nice guy and responded to promptly to my initial query? ;-)

flooie commented 1 week ago

Gotcha, yes.

https://github.com/freelawproject/courtlistener/discussions?discussions_q=is%3Aopen+bulk

Go checkout the discussion page at the courtlistener GitHub page. I think you may be able to find answers to your questions about data over there.

mlissner commented 1 week ago

The bulk data documentation page is here too: https://www.courtlistener.com/help/api/bulk-data/

mlissner commented 1 week ago

Oh, and we're also working on semantic/vector search. It'll be a bit, but we're working on it!

BBC-Esq commented 1 week ago

Why don't you let me spearhead this one? I prefer TileDb because it's more robust but am familiar with ChromaDB already if need be...In my experience, ChromaDB broke down a little when ingesting massive amounts of documents, although to be fair this was 6+ months ago when they not too long after switching to sqlite3 instead of "clickhouse" what not...it's been awhile.

Anyways, TileDB has been darn reliable and robust ever since.

If I get the benefit of massive amounts of case law from your guys' hard work I should probably reciprocate so...just point me in the right direction or the peoples to talk with.

mlissner commented 1 week ago

Thanks for the offer! I don't think we're looking for volunteer help building semantic search (it's too big a project), but if you want to take a look at our volunteer backlog, there's lots of stuff on there, or if you want to just poke around in our bug trackers, there are endless things we need to do. :)

I'd suggest you start small though, learn how we work, let us get to know you and your style, and then scale up to bigger things like semantic search. We operate at a really big scale (data; users; API), so everything we do has to be done correctly the first time. We can't just hack together a search engine and release it. We have to move carefully and methodically.

Anyway, see what you can find and thanks!

BBC-Esq commented 1 week ago

VectorDB's are relatively easy to me at this point, but I understand your concern. Will poke around...and thanks for the Data!