Closed BBC-Esq closed 1 week ago
Hi @BBC-Esq - I'm not quite sure I understand what you are asking? Do you mind asking it again in a different way?
Meanwhile, Courtlistener.com, the main repository for scraped opinions and others has ~190,000 opinions from the Supreme Court and Court of Appeals for Georgia.
https://www.courtlistener.com/?q=&type=o&order_by=score%20desc&stat_Published=on&court=ga%20gactapp
We are of course always open to volunteer pull requests if you have courts that are broken or you want to include in the constellation of course juriscraper can collect from.
My comment was specifically in response to this portion of your readme:
Since you have ~190,000...that's more than I have scraped. My offer was to help scraping but it seems that your readme.md is just lagging...
With that being said, let me ask good sir, what's the best way to download all case info for Georgia? I'd love to index it and hook it up to my vector database program for a hybrid lucene-based and vectordb kind of search...entirely local.
As far as I got was finally finding the data...I'm assuming there's a way to download only the necessary portions and I could probably figure it out myself...but why not ask you since you're such a nice guy and responded to promptly to my initial query? ;-)
Gotcha, yes.
https://github.com/freelawproject/courtlistener/discussions?discussions_q=is%3Aopen+bulk
Go checkout the discussion page at the courtlistener GitHub page. I think you may be able to find answers to your questions about data over there.
The bulk data documentation page is here too: https://www.courtlistener.com/help/api/bulk-data/
Oh, and we're also working on semantic/vector search. It'll be a bit, but we're working on it!
Why don't you let me spearhead this one? I prefer TileDb because it's more robust but am familiar with ChromaDB already if need be...In my experience, ChromaDB broke down a little when ingesting massive amounts of documents, although to be fair this was 6+ months ago when they not too long after switching to sqlite3 instead of "clickhouse" what not...it's been awhile.
Anyways, TileDB has been darn reliable and robust ever since.
If I get the benefit of massive amounts of case law from your guys' hard work I should probably reciprocate so...just point me in the right direction or the peoples to talk with.
Thanks for the offer! I don't think we're looking for volunteer help building semantic search (it's too big a project), but if you want to take a look at our volunteer backlog, there's lots of stuff on there, or if you want to just poke around in our bug trackers, there are endless things we need to do. :)
I'd suggest you start small though, learn how we work, let us get to know you and your style, and then scale up to bigger things like semantic search. We operate at a really big scale (data; users; API), so everything we do has to be done correctly the first time. We can't just hack together a search engine and release it. We have to move carefully and methodically.
Anyway, see what you can find and thanks!
VectorDB's are relatively easy to me at this point, but I understand your concern. Will poke around...and thanks for the Data!
I noticed that your program is limited somehow to getting Georgia law, is that correct? I have a Python script that gets it if that would be helpful? Currently, I'm getting over 117,000 cases from combined georgia supreme/appeals courts, just those two.
Let me know if is still an issue and whether a pull request or just submitting a script in the issues might be better. Thanks.